データセットの準備

Contents

データセットの準備¶

import pandas as pd
import numpy as np

Japanese Realistic Textual Entailment Corpus¶

データセットをダウンロードします。データセットのラインセスについてはデータセット情報を確認してください。

データセットのリポジトリをclone をします。

$ git clone https://github.com/megagonlabs/jrte-corpus

データのロード¶

pn_df = pd.read_table(
    "jrte-corpus/data/pn.tsv",
    header=None,
    names=["id", "label", "text", "judges", "usage"]
)

pn_df.head()

	id	label	text	judges	usage
0	pn17q00001	0	出張でお世話になりました。	{"0": 3}	test
1	pn17q00002	0	朝食は普通でした。	{"0": 3}	test
2	pn17q00003	1	また是非行きたいです。	{"1": 3}	test
3	pn17q00004	1	また利用したいと思えるホテルでした。	{"1": 3}	test
4	pn17q00005	1	駅から近くて便利でした。	{"0": 1, "1": 2}	test

ラベルの変換¶

pn_df["label"] = pn_df["label"].map({1: "positive", 0: "neutral", -1: "negative"})

pn_df.head()

	id	label	text	judges	usage
0	pn17q00001	neutral	出張でお世話になりました。	{"0": 3}	test
1	pn17q00002	neutral	朝食は普通でした。	{"0": 3}	test
2	pn17q00003	positive	また是非行きたいです。	{"1": 3}	test
3	pn17q00004	positive	また利用したいと思えるホテルでした。	{"1": 3}	test
4	pn17q00005	positive	駅から近くて便利でした。	{"0": 1, "1": 2}	test

データの保存¶

# 保存用のディレクトリを作成します
import pathlib

pathlib.Path("input").mkdir(exist_ok=True)

save_columns = ["label", "text", "judges"]
pn_df[save_columns].to_csv("input/pn.csv", index=None)

後のモデル学習の際に利用するために、ジャッジが一致しているサンプルのみを取り出した結果も保存しておきます。

import json

judges_all_same = pn_df["judges"] \
    .apply(lambda x: len(json.loads(x).keys()) == 1)

pn_df[judges_all_same] \
    .reset_index(drop=True)[save_columns] \
    .to_csv("input/pn_same_judge.csv", index=None)

二値分類モデルの学習データの作成¶

二値分類モデルの学習データの例としてすぐに使えるように、テキストに前処理を行った結果も保存しておきます。

# positive, neutral, negative の中から negative なレビューを当てるタスクとして、
# negativeを1に、それ以外のpositive, neutralを0に設定します。
pn_df["label_num"] = pn_df["label"].map({"positive": 0, "neutral": 0, "negative": 1})

import spacy

nlp = spacy.load("ja_core_news_md")

def tokenize(text):
   return [token.lemma_ for token in nlp(text)]

2022-05-27 01:08:58.535669: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2022-05-27 01:08:58.535760: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.

pn_df["tokens"] = pn_df["text"].apply(lambda x: " ".join(tokenize(x)))

pn_df.head()

	id	label	text	judges	usage	label_num	tokens
0	pn17q00001	neutral	出張でお世話になりました。	{"0": 3}	test	0	出張でお世話になるますた。
1	pn17q00002	neutral	朝食は普通でした。	{"0": 3}	test	0	朝食は普通ですた。
2	pn17q00003	positive	また是非行きたいです。	{"1": 3}	test	0	また是非行くたいです。
3	pn17q00004	positive	また利用したいと思えるホテルでした。	{"1": 3}	test	0	また利用するたいと思えるホテルですた。
4	pn17q00005	positive	駅から近くて便利でした。	{"0": 1, "1": 2}	test	0	駅から近いて便利ですた。

save_columns = ["text", "label_num", "tokens"]

pn_df[judges_all_same] \
    .reset_index(drop=True)[save_columns] \
    .to_csv("input/pn_same_judge_preprocessed.csv", index=None)

df = pd.read_csv("input/pn_same_judge_preprocessed.csv")

previous

実行環境の準備

next

全体の流れ