機械学習のためのTransformersの学習メモ

自然言語処理の勉強をしています。

BOWやTF-IDF、Word2Vec、Doc2Vec等勉強してきて、transformerを勉強しています。

理論的な話は全く分からないので、実装に必要な部分だけに絞っています。

今回「機械学習エンジニアのためのTransformers」という書籍を購入したので、コードを試す際に躓いた点を書き残しておこうと思います。

機械学習の初心者なので、初歩的な部分で躓いていますが、1つ1つ身に着けていきたいと思います。

Google Colab上で学習しています。

各項目にはエラー発生時のコードと修正後のコードを両方載せています。

また、各項目1つ目のコードは書籍からの引用です。(私の環境下でエラーが発生したコードです。)

「機械学習エンジニアのためのTransformers」は、Hugging Faceが開発したライブラリであるtransformersをその開発者が解説しています。

公式ドキュメントは英語ですが、こちらは日本語です。

訳も読みやすく、全くの初心者でも今のところ続けられています。

この本で勉強して、別の分類タスクに応用するところまでが目標です。

1章
1. No module named “transformers”
2. TypeError: _sanitize_parameters() got an unexpected keyword argument ‘aggregation_starategy’
2章

1章

No module named “transformers”

pipline関数を使ってテキスト分類を行う際に発生しました。

from transformers import pipeline
classifier = pipeline("text-classification")
text = ～何かしらのテキスト～

from transformers import pipeline

classifier = pipeline("text-classification")

text = ～何かしらのテキスト～

pip install transformersで解決しました。

pip install　transformers
from transformers import pipeline
classifier = pipeline("text-classification")
text = ～何かしらのテキスト～

pip install　transformers

from transformers import pipeline

classifier = pipeline("text-classification")

text = ～何かしらのテキスト～

TypeError: _sanitize_parameters() got an unexpected keyword argument ‘aggregation_starategy’

pipline関数の固有値表現認識で発生。

ner_tagger = pipeline("ner", aggregation_starategy="simple")
outputs = ner_tagger(text)
pd.DataFrame(outputs)

ner_tagger = pipeline("ner", aggregation_starategy="simple")

outputs = ner_tagger(text)

pd.DataFrame(outputs)

単なるタイプミスでした。

1行目：”aggregation_starategy“　→”aggregation_strategy”

ner_tagger = pipeline("ner", aggregation_strategy="simple")
outputs = ner_tagger(text)
pd.DataFrame(outputs)

ner_tagger = pipeline("ner", aggregation_strategy="simple")

outputs = ner_tagger(text)

pd.DataFrame(outputs)

2章

ModuleNotFoundError: No module named ‘datasets’

Hugging Face Hubからデータセットをダウンロード時に発生

from datasets import list_datasets

1	from datasets import list_datasets

datasetsをpip installしました。

!pip install datasets
from datasets import list_datasets

1 2	!pip install datasets from datasets import list_datasets

OSError: ditilbert-base-uncased is not a local folder and is not a valid model identifier listed on ‘https://huggingface.co/models‘ If this is a private repository, make sure to pass a token having permission to this repo with use_auth_token or log in with huggingface-cli login and pass use_auth_token=True.

事前学習済モデル使用のため、AutoModelクラスのfrom_pretrained()メソッドを使用しようとした時に発生。

from transformers import AutoModel
model_ckpt = "ditilbert-base-uncased"
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = AutoModel.from_pretrained(model_ckpt).to(device)

from transformers import AutoModel

model_ckpt = "ditilbert-base-uncased"

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

model = AutoModel.from_pretrained(model_ckpt).to(device)

単なるタイプミスでした。

×→”model_ckpt = ditilbert-base-uncased”

○→”model_ckpt = distilbert-base-uncased”

from transformers import AutoModel
model_ckpt = "distilbert-base-uncased"
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = AutoModel.from_pretrained(model_ckpt).to(device)

from transformers import AutoModel

model_ckpt = "distilbert-base-uncased"

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

model = AutoModel.from_pretrained(model_ckpt).to(device)

ModuleNotFoundError: No module named ‘umap’

特徴量を可視化するため、umapを使って2次元ベクトルに射影する際に発生。

from umap import UMAP

1	from umap import UMAP

インストールしてみましたが、次は2行目のインポート部分でエラーが発生しました。

インポートできないので4行目でもうまくいきません。

!pip install umap
from umap import UMAP
#中略
mapper = UMAP(n_components=2, metric="cosine").fit_(x_scaled)

!pip install umap

from umap import UMAP

#中略

mapper = UMAP(n_components=2, metric="cosine").fit_(x_scaled)

それぞれ以下のように変更して動作しました。

!pip install umap
!pip install umap-learn
import umap.umap_ as umap
#UMAPの初期化とfit
mapper = umap.UMAP(n_components=2, metric="cosine").fit(x_scaled)

!pip install umap

!pip install umap-learn

import umap.umap_ as umap

#UMAPの初期化とfit

mapper = umap.UMAP(n_components=2, metric="cosine").fit(x_scaled)

ValueError: You need to pass a valid token or login by using huggingface-cli login

モデルの学習時、Hugging Face Hubにログインが必要だといわれてしまいました。

from transformers import Trainer
trainer = Trainer(model=model, args=training_args,
                  compute_metrics=compute_metrics,
                  train_dataset=emotions_encoded["train"],
                  eval_dataset=emotions_encoded["validation"],
                  tokenizer=tokenizer)

from transformers import Trainer

trainer = Trainer(model=model, args=training_args,

compute_metrics=compute_metrics,

train_dataset=emotions_encoded["train"],

eval_dataset=emotions_encoded["validation"],

tokenizer=tokenizer)

transformersのTrainingArgumentsを使用する際は、Hugging Faceaにログインが必要なようです。

Huggin Faceにアクセスし、Sign Upで仮登録＞登録に用いたメールアドレスに認証メールが届くので、リンクをクリックして認証＞マイページでアクセストークンの作成ができます。

アクセストークンはread,writeの2種類ありますが、writeで作成が正しいようです。

※readで作成してみましたがうまくいきませんでした。

以下修正。1行目にログイン用のコードを追記して実行後、Tokenの入力が求められるので、Hugging Faceで作成したトークンをコピペすることで動作しました。

!huggingface-cli login
from transformers import Trainer

trainer = Trainer(model=model, args=training_args,
                  compute_metrics=compute_metrics,
                  train_dataset=emotions_encoded["train"],
                  eval_dataset=emotions_encoded["validation"],
                  tokenizer=tokenizer)

!huggingface-cli login