{ "cells": [ { "cell_type": "markdown", "id": "6fbb04ce", "metadata": {}, "source": [ "# 探索的データ分析\n", "\n", "探索的データ分析では分類対象のクラスに寄与する特徴量を見つける目的でデータをチェックします。" ] }, { "cell_type": "code", "execution_count": 1, "id": "a58387a9-1be7-48b8-9d25-ed6246232a04", "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np" ] }, { "cell_type": "markdown", "id": "4dc15a91-88e5-4f61-8f18-c1fc2e802ce2", "metadata": {}, "source": [ "利用するデータをロードします。" ] }, { "cell_type": "code", "execution_count": 2, "id": "4833a43a-15fa-45c7-a156-372ca3afb9cb", "metadata": {}, "outputs": [], "source": [ "data = pd.read_csv(\"input/pn.csv\")" ] }, { "cell_type": "markdown", "id": "989443c0-923c-45e3-b777-30c10a6cb60e", "metadata": {}, "source": [ "## 文の長さのチェック" ] }, { "cell_type": "markdown", "id": "d398edd4", "metadata": {}, "source": [ "テキストデータでは長さが特徴の一つになります。\n", "クラスごとにテキストの長さに特徴がないかを確認します。" ] }, { "cell_type": "code", "execution_count": 3, "id": "a04887a2-805d-402f-b532-10a78668d08e", "metadata": {}, "outputs": [], "source": [ "data[\"text_len\"] = data[\"text\"].apply(lambda x: len(x))" ] }, { "cell_type": "code", "execution_count": 4, "id": "e53ba1c6", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
countmeanstdmin25%50%75%max
label
negative818.024.12713920.2408623.09.014.034.0122.0
neutral1329.015.98118913.9395353.08.012.017.0131.0
positive3406.018.97181414.7505873.011.014.020.0132.0
\n", "
" ], "text/plain": [ " count mean std min 25% 50% 75% max\n", "label \n", "negative 818.0 24.127139 20.240862 3.0 9.0 14.0 34.0 122.0\n", "neutral 1329.0 15.981189 13.939535 3.0 8.0 12.0 17.0 131.0\n", "positive 3406.0 18.971814 14.750587 3.0 11.0 14.0 20.0 132.0" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.groupby(\"label\")[\"text_len\"].describe(include=[np.number])" ] }, { "cell_type": "markdown", "id": "b980b4ba", "metadata": {}, "source": [ "なお、`agg`を使って次のように調べることができます。\n", "この場合はパーセンタイルは出ないことに注意します。" ] }, { "cell_type": "code", "execution_count": 5, "id": "aaeb54ac", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
text_len
meanstdaminamaxmedian
label
negative24.12713920.240862312214.0
neutral15.98118913.939535313112.0
positive18.97181414.750587313214.0
\n", "
" ], "text/plain": [ " text_len \n", " mean std amin amax median\n", "label \n", "negative 24.127139 20.240862 3 122 14.0\n", "neutral 15.981189 13.939535 3 131 12.0\n", "positive 18.971814 14.750587 3 132 14.0" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.groupby(\"label\").agg({\"text_len\": [np.mean, np.std, np.min, np.max, np.median]})" ] }, { "cell_type": "markdown", "id": "5063edca", "metadata": {}, "source": [ "positiveとnegativeを比較したとき、中央値は等しいですが、標準偏差がnegativeの方が大きく、\n", "テキストの長さが一点には集中していなさそうです。\n", "描画して確かめてみましょう。" ] }, { "cell_type": "code", "execution_count": 6, "id": "33813885", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "import matplotlib.pyplot as plt\n", "\n", "data.query('label == \"positive\"')[\"text_len\"].plot(label=\"positive\", kind=\"hist\", bins=30, alpha=0.8)\n", "data.query('label == \"negative\"')[\"text_len\"].plot(label=\"nagative\", kind=\"hist\", bins=30, alpha=0.8)\n", "\n", "plt.xlabel(\"text length\")\n", "plt.legend()" ] }, { "cell_type": "markdown", "id": "2eb1bba8", "metadata": {}, "source": [ "## クラスと単語の関係\n", "\n", "クラスに特徴的な単語は分類で役立つ可能性が高いです。\n", "このステップでは、クラス毎に単語の頻度を確認して、クラスに特徴的な単語はないかを確認します。\n" ] }, { "cell_type": "code", "execution_count": 7, "id": "8145cbcc", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "2022-05-25 08:13:47.440165: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory\n", "2022-05-25 08:13:47.440275: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.\n" ] } ], "source": [ "# トークナイザを定義\n", "import spacy\n", "\n", "nlp = spacy.load(\"ja_core_news_md\")\n", "\n", "def tokenize(text):\n", " return [t.lemma_ for t in nlp(text)]" ] }, { "cell_type": "markdown", "id": "5e140d3d", "metadata": {}, "source": [ "positiveに出現する単語をチェックしてみましょう。\n", "単語のカウントには [collections.Counter](https://docs.python.org/ja/3/library/collections.html#collections.Counter) を使うと便利です。" ] }, { "cell_type": "code", "execution_count": 8, "id": "4e792837", "metadata": {}, "outputs": [], "source": [ "# label毎に出現する単語をチェックする関数を定義します。\n", "\n", "from collections import Counter\n", "\n", "\n", "def check_words(label, most_commont=10):\n", " samples = data.query(f'label == \"{label}\"')\n", "\n", " cnt = Counter()\n", " for words in samples[\"text\"].apply(lambda sent: tokenize(sent)):\n", " cnt.update(words)\n", " \n", " total = sum(cnt.values())\n", " \n", " res = []\n", " \n", " for word, count in cnt.most_common(most_commont):\n", " res.append((word, count, round(count/total, 4)))\n", "\n", " return res" ] }, { "cell_type": "code", "execution_count": 9, "id": "a6010ddf", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[('。', 3149, 0.0821),\n", " ('た', 2565, 0.0669),\n", " ('です', 1823, 0.0475),\n", " ('ます', 1396, 0.0364),\n", " ('の', 1384, 0.0361),\n", " ('も', 1345, 0.0351),\n", " ('が', 971, 0.0253),\n", " ('て', 956, 0.0249),\n", " ('だ', 909, 0.0237),\n", " ('、', 875, 0.0228),\n", " ('は', 829, 0.0216),\n", " ('に', 811, 0.0211),\n", " ('する', 740, 0.0193),\n", " ('で', 485, 0.0126),\n", " ('良い', 477, 0.0124),\n", " ('と', 475, 0.0124),\n", " ('お', 452, 0.0118),\n", " ('を', 366, 0.0095),\n", " ('部屋', 336, 0.0088),\n", " ('ある', 334, 0.0087),\n", " ('とても', 270, 0.007),\n", " ('また', 253, 0.0066),\n", " ('美味しい', 247, 0.0064),\n", " ('利用', 227, 0.0059),\n", " ('!', 223, 0.0058),\n", " ('いる', 221, 0.0058),\n", " ('たい', 212, 0.0055),\n", " ('満足', 206, 0.0054),\n", " ('ホテル', 204, 0.0053),\n", " ('風呂', 201, 0.0052),\n", " ('朝食', 199, 0.0052),\n", " ('思う', 196, 0.0051),\n", " ('できる', 170, 0.0044),\n", " ('から', 158, 0.0041),\n", " ('最高', 146, 0.0038),\n", " ('広い', 130, 0.0034),\n", " ('ない', 129, 0.0034),\n", " ('対応', 126, 0.0033),\n", " ('こと', 125, 0.0033),\n", " ('大', 119, 0.0031),\n", " ('なる', 113, 0.0029),\n", " ('綺麗', 113, 0.0029),\n", " ('よい', 112, 0.0029),\n", " ('行く', 109, 0.0028),\n", " ('いただく', 107, 0.0028),\n", " ('温泉', 104, 0.0027),\n", " ('方', 100, 0.0026),\n", " ('接客', 94, 0.0025),\n", " ('清潔', 90, 0.0023),\n", " ('気持ち', 87, 0.0023)]" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "positive_words = check_words(\"positive\", 50)\n", "positive_words" ] }, { "cell_type": "code", "execution_count": 10, "id": "5cd81f88", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[('。', 727, 0.059),\n", " ('の', 640, 0.0519),\n", " ('が', 638, 0.0518),\n", " ('た', 603, 0.0489),\n", " ('、', 423, 0.0343),\n", " ('です', 365, 0.0296),\n", " ('は', 344, 0.0279),\n", " ('に', 322, 0.0261),\n", " ('だ', 304, 0.0247),\n", " ('ます', 294, 0.0239),\n", " ('て', 291, 0.0236),\n", " ('と', 241, 0.0196),\n", " ('する', 217, 0.0176),\n", " ('も', 159, 0.0129),\n", " ('ない', 157, 0.0127),\n", " ('を', 133, 0.0108),\n", " ('で', 128, 0.0104),\n", " ('ある', 117, 0.0095),\n", " ('か', 112, 0.0091),\n", " ('残念', 109, 0.0088),\n", " ('.', 107, 0.0087),\n", " ('部屋', 103, 0.0084),\n", " ('お', 88, 0.0071),\n", " ('いる', 84, 0.0068),\n", " ('ぬ', 84, 0.0068),\n", " ('風呂', 65, 0.0053),\n", " ('なる', 62, 0.005),\n", " ('少し', 61, 0.005),\n", " ('思う', 60, 0.0049),\n", " ('ただ', 52, 0.0042),\n", " ('?', 50, 0.0041),\n", " ('朝食', 46, 0.0037),\n", " ('時', 44, 0.0036),\n", " ('から', 43, 0.0035),\n", " ('れる', 43, 0.0035),\n", " ('狭い', 40, 0.0032),\n", " ('てる', 37, 0.003),\n", " ('気', 37, 0.003),\n", " ('こと', 36, 0.0029),\n", " ('良い', 35, 0.0028),\n", " ('な', 32, 0.0026),\n", " ('方', 31, 0.0025),\n", " ('悪い', 30, 0.0024),\n", " ('ば', 30, 0.0024),\n", " ('・', 30, 0.0024),\n", " ('ちょっと', 29, 0.0024),\n", " ('ホテル', 27, 0.0022),\n", " ('しまう', 25, 0.002),\n", " ('だけ', 25, 0.002),\n", " ('ね', 24, 0.0019)]" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "negative_words = check_words(\"negative\", 50)\n", "negative_words" ] }, { "cell_type": "markdown", "id": "41fa08ec", "metadata": {}, "source": [ "結果を見ると、\n", "positiveラベルでは「美味しい」「利用」「満足」「綺麗」「清潔」といったポジティブな単語が、\n", "negativeラベルでは「残念」「狭い」「悪い」といったネガティブな単語が特徴的にあらわれていることがわかります。\n", "\n", "一方で、句読点「。」や助詞「が」といった共通の単語も多く存在していることもわかります。\n", "このような共通の単語はストップワードとして定義するとよいでしょう。" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.6" } }, "nbformat": 4, "nbformat_minor": 5 }