BERTで感情予測をしてみる - まさおのデータサイエンティストへの道

https://arxiv.org/pdf/1810.04805.pdf

0.目次

0.目次
1.やりたいこと
2.BERT(Bidirectional Encoder Representation from Transformers)
- 2.1.BERTの事前学習
  - Masked Language Model
  - Next Sentence Prediction
- 2.2.BERTのファインチューニング
3.Transformersライブラリ
4.使用するデータセット
5.学習済みBERTを使用した感情分析モデルの構築(ファインチューニング)
6.参考文献

1.やりたいこと

最近流行のChatGPT等のLLM(Large Language Model)について勉強していると、Hugging face社の学習済みLLMを使って簡単にファインチューニングをして個々のタスクに適用できることを知りました。今回は、学習済みBERT(Bidirectional Encoder Representation from Transformers)ファインチューニングして人間の感情分類するモデルを構築してみます。

2.BERT(Bidirectional Encoder Representation from Transformers)

BERTは、2018年にGoogleによって発表され、NLPタスクにおける驚異的な性能向上をもたらしました。Transformersモデルのエンコーダ部分だけを使用したNLPモデルです。詳しくは以下のBERTの論文を参照ください。

https://arxiv.org/pdf/1810.04805.pdf

2.1.BERTの事前学習

私たちがLLMを使用する際に１からモデルを学習させることはほとんどありません。BERTでは大規模な文章を用いて汎用的な言語パターンを事前にモデルに学習させておき（事前学習）、個々のタスクに合わせてモデルをチューニングしていきます。
BERTではMasked Language ModelとNext Sentence Predictionという２つの方法を組み合わせて大量のデータから学習を行なっていきます。

Masked Language Model

テキストの一部をマスクして前後の文章からマスク部分の単語を予測します。前後の文脈からマスク部分を予測できるように学習させます。

Next Sentence Prediction

2つの文を入力して２つの文章の関係性について予測します。２つの分の関係性を見分けられるように学習させます。

2.2.BERTのファインチューニング

BERTで個別のタスクを解くためには、タスクの内容に応じてBERTに新しい分類機などを接続するなどして、タスクに特化したモデルを作ります。事前学習の結果、出力層側では大量の文章の文脈に沿って動的に重み付けされたテキストのベクトル値が得られます。個別のタスクに応じてモデルをチューニングさせ、より個別のタスクに有利なようにこのベクトル値を調整していきます。ファインチューニングの際事前学習で得られたパラメータを初期値として用いることで比較的少数の学習データでも高い性能のモデルを得られます。

3.Transformersライブラリ

今回アメリカのHugging face社が提供するTransformersライブラリを使用します。テキスト要約、分類、生成等の様々な自然言語処理モデルを提供しており、事前学習モデルを使用することで簡単に個別のタスクに応じたファインチューニングが可能です。詳しくは以下のリンクを参照ください。

huggingface.co

4.使用するデータセット

今回の感情分析モデルを構築するにあたり以下のデータを使用します。

www.kaggle.com 使用するデータはテキストと感情ラベルで構成されています。
'anger', 'disgust', 'fear', 'guilt', 'joy', 'sadness', 'shame'と７つの感情ラベルがありますが、今回は'joy'なのか'anger'なのかを判別する二値分類モデルを構築したいと思います。

5.学習済みBERTを使用した感情分析モデルの構築(ファインチューニング)

Google Colaboratoryを使用して行います。

5.1.ライブラリのインストール

!pip install transformers==4.26.0
!pip install datasets

Huggingface社のBERT等のLLMモデルを使用することができるライブラリTransformersをインストールします。加えて、同じくHuggingface社のdatasetsモジュールをインストールしておきます。このモジュールを使用することでpandas.DataFrame形式で読み込んだtextデータとラベルをBERTを学習する際のデータセット(Dataset形式)に整形します。

huggingface.co

5.2.BERTモデルとTokenizerのインポート

from transformers import BertForSequenceClassification, BertTokenizerFast

#文章分類用のBertモデルの読み込み
id2label = {0: "anger", 1: "joy"}
label2id = {"anger": 0, "joy": 1}
sc_model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2, id2label=id2label, label2id=label2id)
#トークナイザを読み込む
tokenizer = BertTokenizerFast.from_pretrained("bert-base-uncased")
print(sc_model)

transformersライブラリの中にはBERT学習済みモデルや学習済みBERTを使用したファインチューニング用モデルを提供しています。今回はその中でもテキストの分類を行うためのモデルBertForSequenceClassificationを使用します。

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
          )
          (intermediate): BertIntermediate(
            (dense): Linear(in_features=768, out_features=3072, bias=True)
            (intermediate_act_fn): GELUActivation()
          )
          (output): BertOutput(
            (dense): Linear(in_features=3072, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
        )
      )
    )
    (pooler): BertPooler(
      (dense): Linear(in_features=768, out_features=768, bias=True)
      (activation): Tanh()
    )
  )
  (dropout): Dropout(p=0.1, inplace=False)
  (classifier): Linear(in_features=768, out_features=2, bias=True)
)

ネットワークを確認してみると、単語のベクトル化や位置情報の埋め込み等を行うembeddingsレイヤー、その後にtransoformerのエンコーダで12回のMultiHeadSelfAttention処理が行われるencoderレイヤー、最後にencoderレイヤーでの出力を二値分類用に出力を変換するclassifierレイヤーで構成されていることが分かります。

5.3.データの読み込み／準備

import pandas as pd
import numpy as np

file = './drive/MyDrive/Emotions.xlsx'

data = pd.read_excel(file, index_col=0)
data = data.T[['anger', 'joy']].T

#データ分割
from sklearn.model_selection import train_test_split
train_data, eval_data = train_test_split(data, test_size=0.2, random_state=1995, stratify=data.index)

train_data['label'] = train_data.index
train_data = train_data.reset_index()
eval_data['label'] = eval_data.index
eval_data = eval_data.reset_index()

train_data = train_data.drop(['Field1'], axis=1)
eval_data = eval_data.drop(['Field1'], axis=1)

#ラベルを数値に変換する
from sklearn.preprocessing import LabelEncoder
class_le = LabelEncoder()
train_data['label'] = class_le.fit_transform(train_data['label'])
eval_data['label'] = class_le.fit_transform(eval_data['label'])

データをpandas.DataFrame形式で読み込んでラベルを数値に変換して学習データと評価用データに分割しました。先述したように今回はラベルがangerとjoyのもののみを使用します。

from datasets import Dataset, DatasetDict
train_dataset = Dataset.from_pandas(train_data)
eval_dataset = Dataset.from_pandas(eval_data)
print(train_dateset)

datesetsモジュールのDataset.from_pandasを使用してpandas.DataFrame形式のデータをDataset形式のデータに変換します。

Dataset({
    features: ['SIT', 'label'],
    num_rows: 1752
})

続いて、tokenizerを使用して、テキスト内の単語を一意の数値に変換します。Dataset形式のデータはmap関数を使用することでバッチで処理を実行できます。

def preprocess_text_classification(example):

  encoded_example = tokenizer(example['SIT'], max_length=256)
  encoded_example['labels'] = example['label']

  return encoded_example

encoded_train_dataset = train_dataset.map(preprocess_text_classification, remove_columns=train_dataset.column_names)
encoded_eval_dataset = eval_dataset.map(preprocess_text_classification, remove_columns=eval_dataset.column_names)

tokenizerを通ったデータは以下のように'input_ids', 'token_type_ids', 'attention_mask', 'labels'というカラムを持っています。

encoded_train_dataset

Dataset({
    features: ['input_ids', 'token_type_ids', 'attention_mask', 'labels'],
    num_rows: 1752
})

実際に0番目のデータを確認してみます。

print(encoded_train_dataset[0])

{'input_ids': [101, 1045, 2018, 2025, 2464, 2026, 2567, 2005, 2274, 2086, 2004, 2002, 2001, 2025, 1999, 3577, 1012, 1037, 2043, 2002, 3369, 2012, 1996, 3199, 1010, 1045, 2371, 2307, 6569, 1012, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'labels': 1}

input_idsにはテキスト内の単語が一意の数値に変換されたものが格納されています。token_type_idsには全て0が格納されています。二つのセンテンスのつながりの整合性を予測するようなタスクの場合、同一センテンス内の単語は同一の値になるように設定する必要があります。attention_maskは学習時にマスクをするかどうかを示します。今回は全てに１が入っているのでマスクはしません。

#バッチごとにパディングしてサイズを揃える
from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

最後に数値化したテキストデータが全てのサンプルで同じサイズになるようにするDataCollatorWithPaddingクラスをインスタンス化しておきます。ニューラルネットワークモデルでは入力サイズを統一しておく必要があるので、最大サイズとなるサンプルとサイズが同じになるように足りない分を0で埋める処理を行います。ここでインスタン化したインスタンスは後程BERTモデルをインスタンス化する際に使用します。

5.4.BERTモデルの定義

TrainingArgumentsクラスをインスタンス化する際に学習時のハイパーパラメータ等を設定できます。

#学習に関わる設定
from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir = './results',
    logging_dir = './logs',
    num_train_epochs = 1,
    per_device_train_batch_size = 8,
    per_device_eval_batch_size= 32,
    warmup_steps=500,
    weight_decay = 0.01,
    evaluation_strategy='steps'
)

5.5.BERTモデルの訓練（ファインチューニング）／精度検証

分類精度の評価用に以下の関数を作成しておきます。

#評価用の関数
from sklearn.metrics import accuracy_score

def compute_metrics(result):

  labels = result.label_ids
  preds = result.predictions.argmax(-1)
  acc = accuracy_score(labels, preds)

  return {'accuracy': acc}

いよいよ学習済みのBERTモデルを使用したファインチューニングを行います。 Trainerクラスを使用することで簡単に学習が行えます。

from transformers import Trainer

#Trainerクラスのインスタンス化
trainer = Trainer(
    model=sc_model,
    args = training_args,
    compute_metrics=compute_metrics,
    train_dataset=encoded_train_dataset,
    eval_dataset = encoded_eval_dataset,
    data_collator=data_collator,
)

#学習
trainer.train()

以下のようなログが学習中に表示されます。

***** Running training *****
  Num examples = 1752
  Num Epochs = 1
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 219
  Number of trainable parameters = 109483778
You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
 [219/219 00:29, Epoch 1/1]
Step    Training Loss   Validation Loss


Training completed. Do not forget to share your model on huggingface.co/models =)


TrainOutput(global_step=219, training_loss=0.43091018885782323, metrics={'train_runtime': 32.6082, 'train_samples_per_second': 53.729, 'train_steps_per_second': 6.716, 'total_flos': 47150323342560.0, 'train_loss': 0.43091018885782323, 'epoch': 1.0})

学習が行えたので、バリデーションデータに対する精度検証を行います。

trainer.evaluate()

***** Running Evaluation *****
  Num examples = 438
  Batch size = 32
 [14/14 01:37]
{'eval_loss': 0.13717880845069885,
 'eval_accuracy': 0.95662100456621,
 'eval_runtime': 1.4407,
 'eval_samples_per_second': 304.009,
 'eval_steps_per_second': 9.717,
 'epoch': 1.0}

正解率は95%という結果となりました。

#学習済みモデルの保存
import os
!mkdir -p drive/MyDrive/emotion-bert
!cp -r results drive/MyDrive/emotion-bert

base_path = './drive/MyDrive/emotion-bert/'
model_path = base_path + 'model/'

if not os.path.exists(model_path):

  os.mkdir(model_path)

sc_model.save_pretrained(model_path)
tokenizer.save_pretrained(model_path)

グーグルドライブをマウントしてファインチューニング後のモデルを保存しておきます。
モデルの読み込みは下記で行えます。

#モデル読み込み
loaded_model = BertForSequenceClassification.from_pretrained(model_path)
loaded_tokenizer = BertTokenizerFast.from_pretrained(model_path)

5.6.予測結果の確認

pipelineを使うと、テキストデータを与えると予測ラベルや確信度を返してくれる。ファインチューニング後のBERTとTokenizerを指定してインスタンス化を行う。

from transformers import pipeline
sc_pipeline = pipeline("sentiment-analysis", model=loaded_model, tokenizer=loaded_tokenizer)

例えば、バリデーション用のテキストデータを与えると、

sc_pipeline(eval_dataset['SIT'][1])

以下のように予測ラベルと確信度が返ってきます。

[{'label': 'joy', 'score': 0.9818763136863708}]

それではこのpipelineを使用してバリデーションデータに対する二値分類混合行列を作成してみます。

#resultsに各バリデーションデータに対する予測ラベル、正解ラベル、確信度を格納していきます。
results = []
for i in eval_data.index:
  model_pred = sc_pipeline(eval_data['SIT'][i])[0]

  true_label = id2label[eval_data['label'][i]]

  results.append(
      {
          "id" : i,
          "pred_prob": model_pred['score'],
          "pred_label" : model_pred['label'],
          "true_label" : true_label,
      }
  )

続いて混合行列を作成します。

import matplotlib.pyplot as plt
from sklearn.metrics import ConfusionMatrixDisplay, confusion_matrix

#混合行列の作成
confusion_matrix = confusion_matrix(
    y_true=[result['true_label'] for result in results],
    y_pred=[result['pred_label'] for result in results],
    labels=['anger', 'joy']
)

#混合行列を画像として描画
ConfusionMatrixDisplay(
    confusion_matrix,
    display_labels=['anger', 'joy']
).plot()

最後に予測が誤った事例を確認してみます。

#予測が誤った事例を確認する

failed_results = [res for res in results if res['pred_label'] != res['true_label']]

for result in failed_results:

  text = eval_data['SIT'][result['id']]
  print(f"テキスト:{text}")
  print(f"予測:{result['pred_label']}")
  print(f"正解:{result['true_label']}")
  print(f"予測確率:{result['pred_prob']:.4f}")
  print("-------------------------------------")

以下が予測が誤った事例です。

テキスト:Found money on the road and returned it to the owner through á
police.
予測:anger
正解:joy
予測確率:0.9824
-------------------------------------
テキスト:I was stood up for a date function by someone who I really cared á
for.
予測:joy
正解:anger
予測確率:0.8022
-------------------------------------
テキスト:[ No description.]
予測:joy
正解:anger
予測確率:0.8959
-------------------------------------
テキスト:In a conversation my boyfriend expressed definite and quite á
pretentious opinions and he took up an attitude towards a theory á
which he himself had never known. His information was from á
fortuitous sources.
予測:joy
正解:anger
予測確率:0.9674
-------------------------------------
テキスト:Two years ago, somebody I like very much wanted to give up his á
studies. I tried to make him understand the importance of what he á
was going to do, not only of the difficulty to find a job but also á
because he will decrease his culture etc. This person
予測:anger
正解:joy
予測確率:0.8549
-------------------------------------
テキスト:Blank.
予測:anger
正解:joy
予測確率:0.8191
-------------------------------------
テキスト:My little niece, who is very talkative, suddenly became very á
naughty and began wetting her pants.  She did it one afternoon.
予測:joy
正解:anger
予測確率:0.9705
-------------------------------------
テキスト:This week I was phoned by an old friend with whom I lost contact á
a few years ago.
予測:anger
正解:joy
予測確率:0.5844
-------------------------------------
テキスト:When I found out that the guy I was dating at a particular time á
had a steady relationship going on with someone else for a long á
time.
予測:joy
正解:anger
予測確率:0.9733
-------------------------------------
テキスト:The stories about the way my grandmother treated my mother.
予測:joy
正解:anger
予測確率:0.9004
-------------------------------------
テキスト:The sight of a man who ran amok (fighting) at a dance.
予測:joy
正解:anger
予測確率:0.7689
-------------------------------------
テキスト:Discussing psychology with my friends before the lecture.
予測:anger
正解:joy
予測確率:0.7103
-------------------------------------
テキスト:A month ago when one of my fellow workers got a promotion over á
me. It was just a small promotion but recognition was involved.
予測:joy
正解:anger
予測確率:0.9798
-------------------------------------
テキスト:I got my driving licence after they had frightened me with it's á
difficulty.
予測:anger
正解:joy
予測確率:0.9134
-------------------------------------
テキスト:A member of a religious sect tried to convert me, using really á
evil tricks to persuade me.  After he had left, I was anxious and á
angry for a long time. After the event, I was alone.
予測:joy
正解:anger
予測確率:0.7611
-------------------------------------
テキスト:Being asked to go out by someone I care. 
予測:anger
正解:joy
予測確率:0.9723
-------------------------------------
テキスト:When my mother kept me in leading-strings.
予測:joy
正解:anger
予測確率:0.9706
-------------------------------------
テキスト:A father helping his kid to fight other kids.
予測:joy
正解:anger
予測確率:0.9776
-------------------------------------

ここで一つピックアップしてみてみます。

テキスト:This week I was phoned by an old friend with whom I lost contact á
a few years ago.
予測:anger
正解:joy
予測確率:0.5844

この事例では、「連絡をとっていなかった旧友人から電話がかかってきた」というハッピーなニュースですが、モデルは'anger'と予測してしまっています。ただ、確信度は低く、やや判定に迷っていることが伺えます。
この分の中に’lost'というネガティブなワードが入っているのでこのワードを除くと予測はどうなるか見てみます。

sc_pipeline("This week I was phoned by an old friend with whom I contact á a few years ago.")

結果は'joy'となりました！

[{'label': 'joy', 'score': 0.9774478673934937}]

'lost'というワードによって誤った予測になってしまっていたことが分かりました。文脈を学習するBERTですが、ある１単語が予測に大きな影響を与えることもあるようです。

6.参考文献

BERT実践入門 PyTorch + Google Colaboratoryで学ぶあたらしい自然言語処理技術

作者:我妻幸長
翔泳社

Amazon

大規模言語モデル入門

作者:山田育矢,鈴木正敏,山田康輔,李凌寒
技術評論社

Amazon