{ "cells": [ { "cell_type": "markdown", "id": "6fbb04ce", "metadata": {}, "source": [ "# 探索的データ分析\n", "\n", "探索的データ分析では分類対象のクラスに寄与する特徴量を見つける目的でデータをチェックします。" ] }, { "cell_type": "code", "execution_count": 1, "id": "a58387a9-1be7-48b8-9d25-ed6246232a04", "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np" ] }, { "cell_type": "markdown", "id": "4dc15a91-88e5-4f61-8f18-c1fc2e802ce2", "metadata": {}, "source": [ "利用するデータをロードします。" ] }, { "cell_type": "code", "execution_count": 2, "id": "4833a43a-15fa-45c7-a156-372ca3afb9cb", "metadata": {}, "outputs": [], "source": [ "data = pd.read_csv(\"input/pn.csv\")" ] }, { "cell_type": "markdown", "id": "989443c0-923c-45e3-b777-30c10a6cb60e", "metadata": {}, "source": [ "## 文の長さのチェック" ] }, { "cell_type": "markdown", "id": "d398edd4", "metadata": {}, "source": [ "テキストデータでは長さが特徴の一つになります。\n", "クラスごとにテキストの長さに特徴がないかを確認します。" ] }, { "cell_type": "code", "execution_count": 3, "id": "a04887a2-805d-402f-b532-10a78668d08e", "metadata": {}, "outputs": [], "source": [ "data[\"text_len\"] = data[\"text\"].apply(lambda x: len(x))" ] }, { "cell_type": "code", "execution_count": 4, "id": "e53ba1c6", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
countmeanstdmin25%50%75%max
label
negative818.024.12713920.2408623.09.014.034.0122.0
neutral1329.015.98118913.9395353.08.012.017.0131.0
positive3406.018.97181414.7505873.011.014.020.0132.0
\n", "
" ], "text/plain": [ " count mean std min 25% 50% 75% max\n", "label \n", "negative 818.0 24.127139 20.240862 3.0 9.0 14.0 34.0 122.0\n", "neutral 1329.0 15.981189 13.939535 3.0 8.0 12.0 17.0 131.0\n", "positive 3406.0 18.971814 14.750587 3.0 11.0 14.0 20.0 132.0" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.groupby(\"label\")[\"text_len\"].describe(include=[np.number])" ] }, { "cell_type": "markdown", "id": "b980b4ba", "metadata": {}, "source": [ "なお、`agg`を使って次のように調べることができます。\n", "この場合はパーセンタイルは出ないことに注意します。" ] }, { "cell_type": "code", "execution_count": 5, "id": "aaeb54ac", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
text_len
meanstdaminamaxmedian
label
negative24.12713920.240862312214.0
neutral15.98118913.939535313112.0
positive18.97181414.750587313214.0
\n", "
" ], "text/plain": [ " text_len \n", " mean std amin amax median\n", "label \n", "negative 24.127139 20.240862 3 122 14.0\n", "neutral 15.981189 13.939535 3 131 12.0\n", "positive 18.971814 14.750587 3 132 14.0" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.groupby(\"label\").agg({\"text_len\": [np.mean, np.std, np.min, np.max, np.median]})" ] }, { "cell_type": "markdown", "id": "5063edca", "metadata": {}, "source": [ "positiveとnegativeを比較したとき、中央値は等しいですが、標準偏差がnegativeの方が大きく、\n", "テキストの長さが一点には集中していなさそうです。\n", "描画して確かめてみましょう。" ] }, { "cell_type": "code", "execution_count": 6, "id": "33813885", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAYsAAAEGCAYAAACUzrmNAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjUuMSwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/YYfK9AAAACXBIWXMAAAsTAAALEwEAmpwYAAAc8UlEQVR4nO3dfZxWZb3v8c83IAfSRBE9ChqjmUCAaMPDPmR6xPAhFU1UCrcQlPryIdyWRaZb9zZPlB4fM82njRoqBCLkaVeGmMeOiqCIIBpoiIOobBRSARX97T/WNTTqDOsG5n5ivu/Xa16z1rXWve7fvXT4vq5rrftaigjMzMw25VPlLsDMzCqfw8LMzHI5LMzMLJfDwszMcjkszMwsV9tyF1AMu+yyS3Tr1q3cZZiZVZW5c+f+V0R0bmrbNhkW3bp1Y86cOeUuw8ysqkh6qbltHoYyM7NcDgszM8vlsDAzs1zb5DULM9v2vf/++9TX17N+/fpyl1J1ampq6Nq1K+3atSv4NQ4LM6tK9fX17LDDDnTr1g1J5S6nakQEq1ator6+ntra2oJf52EoM6tK69evp1OnTg6KzSSJTp06bXaPzGFhZlXLQbFltuS8OSzMzCyXr1mY2TbhmOseadHj/facL7fo8Zpz44030qFDB0499VQmTJjAkCFD2GOPPQD49re/zXnnnUfPnj1LUsumOCxKoND/iUv1P6eZVY4zzjhj4/KECRPo1avXxrC45ZZbylXWJ3gYysxsCy1dupTu3bszYsQIevTowbBhw1i7di0zZ87kgAMOoHfv3owePZp3330XgHHjxtGzZ0/69OnD97//fQAuueQSrrjiCqZMmcKcOXMYMWIEffv2Zd26dRxyyCHMmTOHG2+8kfPPP3/j+06YMIGzzz4bgF//+tf079+fvn37cvrpp/PBBx8U5bM6LMzMtsLzzz/PmWeeyaJFi/jsZz/LlVdeyahRo5g0aRLPPPMMGzZs4IYbbmDVqlVMmzaNhQsXMn/+fC688MKPHGfYsGHU1dUxceJE5s2bR/v27TduO+GEE5g2bdrG9UmTJjF8+HAWLVrEpEmT+Mtf/sK8efNo06YNEydOLMrndFiYmW2FPffck0GDBgFwyimnMHPmTGpra/nCF74AwMiRI3n44YfZcccdqampYcyYMdx777106NCh4Pfo3Lkze++9N4899hirVq3iueeeY9CgQcycOZO5c+fSr18/+vbty8yZM3nxxReL8jl9zcLMbCt8/DbUjh07smrVqk/s17ZtW2bPns3MmTOZMmUKv/jFL3jwwQcLfp/hw4czefJkunfvzvHHH48kIoKRI0fy05/+dKs/Rx73LMzMtsKyZct49NFHAbjrrruoq6tj6dKlLFmyBIA777yTgw8+mLfffps1a9Zw1FFHcdVVV/H0009/4lg77LADb731VpPvc/zxxzN9+nTuvvtuhg8fDsDgwYOZMmUKr7/+OgBvvPEGL73U7CzjW8U9CzPbJpTrbsL99tuP66+/ntGjR9OzZ0+uvfZaBg4cyIknnsiGDRvo168fZ5xxBm+88QZDhw5l/fr1RARXXnnlJ441atQozjjjDNq3b78xgBrstNNO9OjRg2effZb+/fsD0LNnT37yk58wZMgQPvzwQ9q1a8f111/P5z73uRb/nIqIFj9oudXV1UUlPfzIt86atbxFixbRo0ePstawdOlSjj76aBYsWFDWOrZEU+dP0tyIqGtqfw9DmZlZLoeFmdkW6tatW1X2KraEw8LMzHIVLSwk3SbpdUkLGrXtLOkBSYvT751SuyRdK2mJpPmSDmz0mpFp/8WSRharXjMza14xexYTgCM+1jYOmBkR+wIz0zrAkcC+6ec04AbIwgW4GBgA9AcubggYMzMrnaKFRUQ8DLzxseahwO1p+XbguEbtd0TmMaCjpN2Bw4EHIuKNiHgTeIBPBpCZmRVZqb9nsVtErEjLrwK7peUuwMuN9qtPbc21f4Kk08h6Jey1114tWLKZVYVfHdyyxzv9zy17vC2wevVq7rrrLs4880wAXnnlFb773e8yZcqUktdStgvckX3Bo8W+5BERN0VEXUTUde7cuaUOa2ZWNqtXr+aXv/zlxvU99tijLEEBpQ+L19LwEun366l9ObBno/26prbm2s3Mym7p0qX06NGD73znO3zxi19kyJAhrFu3jptvvpl+/fqx//77c8IJJ7B27VoAXnjhBQYOHEjv3r258MIL2X777QF4++23GTx4MAceeCC9e/dm+vTpQDal+QsvvEDfvn05//zzWbp0Kb169QJg4MCBLFy4cGMtDdOZv/POO4wePZr+/ftzwAEHbDzW1ip1WMwAGu5oGglMb9R+aroraiCwJg1X/QEYImmndGF7SGozM6sIixcv5qyzzmLhwoV07NiRqVOn8vWvf50nnniCp59+mh49enDrrbcCMHbsWMaOHcszzzxD165dNx6jpqaGadOm8eSTTzJr1iy+973vERGMHz+effbZh3nz5nH55Zd/5H1PPvlkJk+eDMCKFStYsWIFdXV1XHbZZRx66KHMnj2bWbNmcf755/POO+9s9ecs5q2zdwOPAvtJqpc0BhgPfFXSYuCwtA7wO+BFYAlwM3AmQES8AVwKPJF+/j21mZlVhNraWvr27QvAl770JZYuXcqCBQs46KCD6N27NxMnTtzYA3j00Uc58cQTAfjmN7+58RgRwQUXXECfPn047LDDWL58Oa+99tom3/ekk07aOCQ1efJkhg0bBsAf//hHxo8fT9++fTnkkENYv349y5Yt2+rPWbQL3BHxjWY2DW5i3wDOauY4twG3tWBpZmYtZrvtttu43KZNG9atW8eoUaO477772H///ZkwYQIPPfTQJo8xceJEVq5cydy5c2nXrh3dunVj/fr1m3xNly5d6NSpE/Pnz2fSpEnceOONQBY8U6dOZb/99tvqz9aYv8FtZtbC3nrrLXbffXfef//9jzy5buDAgUydOhWAe+65Z2P7mjVr2HXXXWnXrh2zZs3aOM34pqYsh2wo6uc//zlr1qyhT58+ABx++OFcd911NEwS+9RTT7XIZ/IU5Wa2baiAW10bXHrppQwYMIDOnTszYMCAjf/gX3311ZxyyilcdtllHHHEEey4444AjBgxgmOOOYbevXtTV1dH9+7dAejUqRODBg2iV69eHHnkkZx11kcHYIYNG8bYsWO56KKLNrZddNFFnHvuufTp04cPP/yQ2tpa7r///q3+TJ6ivAQ8RblZy6uEKco319q1a2nfvj2SuOeee7j77rtb7G6lzbW5U5S7Z2FmViJz587l7LPPJiLo2LEjt91WPZdjHRZmZiVy0EEHNfk41WrgC9xmVrW2xWH0UtiS8+awMLOqVFNTw6pVqxwYmykiWLVqFTU1NZv1Og9DmVlV6tq1K/X19axcubLcpVSdmpqaj3yDvBAOCzOrSu3ataO2trbcZbQaHoYyM7NcDgszM8vlsDAzs1wOCzMzy+WwMDOzXA4LMzPL5bAwM7NcDgszM8vlsDAzs1wOCzMzy+WwMDOzXA4LMzPL5bAwM7NcDgszM8vlKcq3wjHXPVLuEszMSsI9CzMzy+WwMDOzXA4LMzPL5bAwM7NcDgszM8tVlrCQ9C+SFkpaIOluSTWSaiU9LmmJpEmSPp323S6tL0nbu5WjZjOz1qzkYSGpC/BdoC4iegFtgOHAz4CrIuLzwJvAmPSSMcCbqf2qtJ+ZmZVQuYah2gLtJbUFOgArgEOBKWn77cBxaXloWidtHyxJpSvVzMxKHhYRsRy4AlhGFhJrgLnA6ojYkHarB7qk5S7Ay+m1G9L+nUpZs5lZa1eOYaidyHoLtcAewGeAI1rguKdJmiNpzsqVK7f2cGZm1kg5hqEOA/4WESsj4n3gXmAQ0DENSwF0BZan5eXAngBp+47Aqo8fNCJuioi6iKjr3LlzsT+DmVmrUo6wWAYMlNQhXXsYDDwLzAKGpX1GAtPT8oy0Ttr+YERECes1M2v1ynHN4nGyC9VPAs+kGm4CfgicJ2kJ2TWJW9NLbgU6pfbzgHGlrtnMrLUry6yzEXExcPHHml8E+jex73rgxFLUZWZmTfM3uM3MLJfDwszMcjkszMwsl8PCzMxyOSzMzCyXw8LMzHI5LMzMLJfDwszMcjkszMwsl8PCzMxyOSzMzCyXw8LMzHI5LMzMLJfDwszMcjkszMwsl8PCzMxyOSzMzCxXQWEhqXexCzEzs8pVaM/il5JmSzpT0o5FrcjMzCpOQWEREQcBI4A9gbmS7pL01aJWZmZmFaPgaxYRsRi4EPghcDBwraTnJH29WMWZmVllKPSaRR9JVwGLgEOBYyKiR1q+qoj1mZlZBWhb4H7XAbcAF0TEuobGiHhF0oVFqczMzCpGoWHxNWBdRHwAIOlTQE1ErI2IO4tWnZmZVYRCr1n8CWjfaL1DajMzs1ag0LCoiYi3G1bScofilGRmZpWm0LB4R9KBDSuSvgSs28T+Zma2DSn0msW5wG8kvQII+B/AycUqyszMKktBYRERT0jqDuyXmp6PiPeLV5aZmVWSQnsWAP2Abuk1B0oiIu4oSlVmZlZRCv1S3p3AFcCXyUKjH1C3pW8qqaOkKekb4Isk/ZOknSU9IGlx+r1T2leSrpW0RNL8xtdOzMysNArtWdQBPSMiWuh9rwF+HxHDJH2a7M6qC4CZETFe0jhgHNnUIkcC+6afAcAN6beZmZVIoXdDLSC7qL3V0qy1XwFuBYiI9yJiNTAUuD3tdjtwXFoeCtwRmceAjpJ2b4lazMysMIX2LHYBnpU0G3i3oTEijt2C96wFVgL/IWl/YC4wFtgtIlakfV4FdkvLXYCXG72+PrWtaNSGpNOA0wD22muvLSjLzMyaU2hYXNLC73kgcE5EPC7pGrIhp40iIiRt1pBXRNwE3ARQV1fXUsNlZmZG4c+z+DOwFGiXlp8AntzC96wH6iPi8bQ+hSw8XmsYXkq/X0/bl5M9R6NB19RmZmYlUujdUN8h+0f9V6mpC3DflrxhRLwKvCyp4Tsbg4FngRnAyNQ2EpielmcAp6a7ogYCaxoNV5mZWQkUOgx1FtAfeByyByFJ2nUr3vccYGK6E+pF4FtkwTVZ0hjgJeCktO/vgKOAJcDatK+ZmZVQoWHxbkS8JwkASW2BLb4uEBHzaPp7GoOb2DfIwsrMzMqk0Ftn/yzpAqB9evb2b4DfFq8sMzOrJIWGxTiy212fAU4nGxryE/LMzFqJQicS/BC4Of2YmVkrU1BYSPobTVyjiIi9W7wiMzOrOJszN1SDGuBEYOeWL8fMzCpRoV/KW9XoZ3lEXA18rbilmZlZpSh0GKrxtOCfIutpbM6zMMzMrIoV+g/+/2m0vIFs6o+Tmt7VzMy2NYXeDfW/il2ImZlVrkKHoc7b1PaIuLJlyjEzs0q0OXdD9SOb1A/gGGA2sLgYRZmZWWUpNCy6AgdGxFsAki4B/m9EnFKswszMrHIUGha7Ae81Wn+PfzzJzlrIMdc9UtB+vz3ny0WuxMzsowoNizuA2ZKmpfXj+Mfzss3MbBtX6N1Ql0n6T+Cg1PStiHiqeGWZmVklKXTWWYAOwN8j4hqgXlJtkWoyM7MKU+hjVS8Gfgj8KDW1A35drKLMzKyyFNqzOB44FngHICJeAXYoVlFmZlZZCg2L99LjTQNA0meKV5KZmVWaQsNisqRfAR0lfQf4E34QkplZq5F7N5QkAZOA7sDfgf2Af42IB4pcm5mZVYjcsIiIkPS7iOgNOCDMzFqhQoehnpTUr6iVmJlZxSr0G9wDgFMkLSW7I0pknY4+xSrMzMwqxybDQtJeEbEMOLxE9ZiZWQXK61ncRzbb7EuSpkbECSWoyczMKkzeNQs1Wt67mIWYmVnlyguLaGbZzMxakbxhqP0l/Z2sh9E+LcM/LnB/tqjVmZlZRdhkWEREm1IVYmZmlWtzpihvUZLaSHpK0v1pvVbS45KWSJok6dOpfbu0viRt71aums3MWquyhQUwFljUaP1nwFUR8XngTWBMah8DvJnar0r7mZlZCZUlLCR1Bb4G3JLWBRwKTEm73E726FaAofzjEa5TgMFpfzMzK5Fy9SyuBn4AfJjWOwGrI2JDWq8HuqTlLsDLAGn7mrT/R0g6TdIcSXNWrlxZxNLNzFqfkoeFpKOB1yNibkseNyJuioi6iKjr3LlzSx7azKzVK3RuqJY0CDhW0lFADfBZ4BqyZ2W0Tb2HrsDytP9yYE+y5363BXYEVpW+bDOz1qvkPYuI+FFEdI2IbsBw4MGIGAHMAoal3UYC09PyjLRO2v5gemqfmZmVSDnvhvq4HwLnSVpCdk3i1tR+K9AptZ8HjCtTfWZmrVY5hqE2ioiHgIfS8otA/yb2WQ+cWNLCzMzsIyqpZ2FmZhXKYWFmZrkcFmZmlsthYWZmuRwWZmaWy2FhZma5HBZmZpbLYWFmZrkcFmZmlsthYWZmuRwWZmaWy2FhZma5HBZmZpbLYWFmZrkcFmZmlsthYWZmuRwWZmaWy2FhZma5HBZmZpbLYWFmZrkcFmZmlsthYWZmuRwWZmaWy2FhZma5HBZmZpbLYWFmZrkcFmZmlsthYWZmuRwWZmaWq+RhIWlPSbMkPStpoaSxqX1nSQ9IWpx+75TaJelaSUskzZd0YKlrNjNr7dqW4T03AN+LiCcl7QDMlfQAMAqYGRHjJY0DxgE/BI4E9k0/A4Ab0u/y+tXBXLn67U3ucl7Ha0pUjJlZcZW8ZxERKyLiybT8FrAI6AIMBW5Pu90OHJeWhwJ3ROYxoKOk3UtbtZlZ61bWaxaSugEHAI8Du0XEirTpVWC3tNwFeLnRy+pT28ePdZqkOZLmrFy5snhFm5m1QmULC0nbA1OBcyPi7423RUQAsTnHi4ibIqIuIuo6d+7cgpWamVlZwkJSO7KgmBgR96bm1xqGl9Lv11P7cmDPRi/vmtrMzKxEynE3lIBbgUURcWWjTTOAkWl5JDC9Ufup6a6ogcCaRsNVZmZWAuW4G2oQ8M/AM5LmpbYLgPHAZEljgJeAk9K23wFHAUuAtcC3il3gMdc9krtP3p1QZmbbkpKHRUQ8AqiZzYOb2D+As4palJmZbZK/wW1mZrkcFmZmlsthYWZmuRwWZmaWy2FhZma5HBZmZpbLYWFmZrnK8aW8VuPK1WML2s9TmZtZpXNYVKFCvmEO8NtzvlzkSsystfAwlJmZ5XLPwgruqYB7K2atlXsWZmaWy2FhZma5PAy1Dduc4SUzs01xz8LMzHI5LMzMLJeHoWyz+DseZq2TexZmZpbLYWFmZrkcFmZmlsvXLJpQ6ASA1ryWvrbhayVm5eWehZmZ5XLPwsrKXxw0qw7uWZiZWS73LCpAIddI/IAkMysnh0WVaMmL7g4eM9tcDotWqKWCpxJDx3dNmRWHw8K2WDU/Y7wYF9YdQLYtc1hY0ZW6J+NrQGYtz2FhVaPSvyzpITDbllVNWEg6ArgGaAPcEhHjy1ySVbFyBs/iS5vftiU9HoePlUJVhIWkNsD1wFeBeuAJSTMi4tnyVmZWfu7RWClURVgA/YElEfEigKR7gKGAw8K2KcXs8TTVoymkJ9MSNe276/bZwul/3uR+x1z3SIveOOGAbDmKiHLXkEvSMOCIiPh2Wv9nYEBEnN1on9OA09LqfsDzmzjkLsB/FancUqjm+l17ebj28qmm+j8XEZ2b2lAtPYtcEXETcFMh+0qaExF1RS6paKq5ftdeHq69fKq9/gbVMjfUcmDPRutdU5uZmZVAtYTFE8C+kmolfRoYDswoc01mZq1GVQxDRcQGSWcDfyC7dfa2iFi4FYcsaLiqglVz/a69PFx7+VR7/UCVXOA2M7PyqpZhKDMzKyOHhZmZ5Wp1YSHpCEnPS1oiaVy569kUSXtKmiXpWUkLJY1N7TtLekDS4vR7p3LX2hxJbSQ9Jen+tF4r6fF0/ielGxYqjqSOkqZIek7SIkn/VGXn/V/S/zMLJN0tqaZSz72k2yS9LmlBo7Ymz7Uy16bPMF/SgeWrvNnaL0//38yXNE1Sx0bbfpRqf17S4WUpegu1qrBoNG3IkUBP4BuSepa3qk3aAHwvInoCA4GzUr3jgJkRsS8wM61XqrHAokbrPwOuiojPA28CY8pSVb5rgN9HRHdgf7LPUBXnXVIX4LtAXUT0IrspZDiVe+4nAEd8rK25c30ksG/6OQ24oUQ1NmcCn6z9AaBXRPQB/gr8CCD97Q4Hvphe88v0b1JVaFVhQaNpQyLiPaBh2pCKFBErIuLJtPwW2T9YXchqvj3tdjtwXFkKzCGpK/A14Ja0LuBQYErapSJrl7Qj8BXgVoCIeC8iVlMl5z1pC7SX1BboAKygQs99RDwMvPGx5ubO9VDgjsg8BnSUtHtJCm1CU7VHxB8jYkNafYzse2GQ1X5PRLwbEX8DlpD9m1QVWltYdAFebrRen9oqnqRuwAHA48BuEbEibXoV2K1cdeW4GvgB8GFa7wSsbvSHVKnnvxZYCfxHGkK7RdJnqJLzHhHLgSuAZWQhsQaYS3Wc+wbNnetq+xseDfxnWq622j+itYVFVZK0PTAVODci/t54W2T3Plfc/c+SjgZej4i55a5lC7QFDgRuiIgDgHf42JBTpZ53gDS+P5Qs9PYAPsMnh0qqRiWf602R9GOyoeSJ5a6lJbS2sKi6aUMktSMLiokRcW9qfq2h651+v16u+jZhEHCspKVkw32Hkl0H6JiGRqByz389UB8Rj6f1KWThUQ3nHeAw4G8RsTIi3gfuJfvvUQ3nvkFz57oq/oYljQKOBkbEP77MVhW1N6e1hUVVTRuSxvhvBRZFxJWNNs0ARqblkcD0UteWJyJ+FBFdI6Ib2Xl+MCJGALOAYWm3Sq39VeBlSfulpsFk0+FX/HlPlgEDJXVI/w811F/x576R5s71DODUdFfUQGBNo+GqiqDsQW0/AI6NiLWNNs0AhkvaTlIt2UX62eWocYtERKv6AY4iu0PhBeDH5a4np9Yvk3W/5wPz0s9RZGP/M4HFwJ+Anctda87nOAS4Py3vTfYHsgT4DbBduetrpua+wJx07u8Ddqqm8w78G/AcsAC4E9iuUs89cDfZtZX3yXp1Y5o714DI7mh8AXiG7I6vSqt9Cdm1iYa/2Rsb7f/jVPvzwJHlPveb8+PpPszMLFdrG4YyM7Mt4LAwM7NcDgszM8vlsDAzs1wOCzMzy+WwsFYvzTB75la8vq+ko5rZdkjDjLstSdJxjSfBlPSQpLqWfh+zBg4LM+gIbHFYkH0no8mwKKLjyGZONisJh4UZjAf2kTRP0uUAks6X9ER6JsG/pbbjJc1M3x7eXdJfJe0F/Dtwcnr9yc29iaTPpOcfzE4TFA5N7aMk3Svp9+n5DT9v9Jox6X1mS7pZ0i8k/U/gWODy9J77pN1PTPv9VdJBxTlV1lq1zd/FbJs3juz5A30BJA0hm4qhP9k3hmdI+kpETJN0AnAW2cR8F0fEMkn/SvZN4rNz3ufHZNOejE4PxJkt6U9pW1+yWYXfBZ6XdB3wAXAR2bxUbwEPAk9HxP+XNIPsW/FTUs0AbSOifxoSu5hsjiizFuGwMPukIennqbS+PVl4PAycQzaFxmMRcfcWHPdYSd9P6zXAXml5ZkSsAZD0LPA5YBfgzxHxRmr/DfCFTRy/YaLJuUC3zazNbJMcFmafJOCnEfGrJrZ1JXs+x26SPhURHzaxz6aOe0JEPP+RRmkAWY+iwQds2d9mwzG29PVmzfI1C7NsiGeHRut/AEan54ggqYukXdP03rcB3yB7auF5zby+OX8AzkkzwSLpgJz9nwAOlrRTeu8TNlGzWVE5LKzVi4hVwF8kLZB0eUT8EbgLeFTSM2TPs9gBuAD4fxHxCFlQfFtSD7Kpv3vmXeAGLgXaAfMlLUzrm6prOfC/yWaK/QuwlOypd5A9I+T8dKF8n6aPYNZyPOusWQWTtH1EvJ16FtOA2yJiWrnrstbHPQuzynaJpHlkF9X/RvZsDbOSc8/CzMxyuWdhZma5HBZmZpbLYWFmZrkcFmZmlsthYWZmuf4bfpuPm9RXv20AAAAASUVORK5CYII=\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "import matplotlib.pyplot as plt\n", "\n", "data.query('label == \"positive\"')[\"text_len\"].plot(label=\"positive\", kind=\"hist\", bins=30, alpha=0.8)\n", "data.query('label == \"negative\"')[\"text_len\"].plot(label=\"nagative\", kind=\"hist\", bins=30, alpha=0.8)\n", "\n", "plt.xlabel(\"text length\")\n", "plt.legend()" ] }, { "cell_type": "markdown", "id": "2eb1bba8", "metadata": {}, "source": [ "## クラスと単語の関係\n", "\n", "クラスに特徴的な単語は分類で役立つ可能性が高いです。\n", "このステップでは、クラス毎に単語の頻度を確認して、クラスに特徴的な単語はないかを確認します。\n" ] }, { "cell_type": "code", "execution_count": 7, "id": "8145cbcc", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "2022-05-25 08:13:47.440165: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory\n", "2022-05-25 08:13:47.440275: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.\n" ] } ], "source": [ "# トークナイザを定義\n", "import spacy\n", "\n", "nlp = spacy.load(\"ja_core_news_md\")\n", "\n", "def tokenize(text):\n", " return [t.lemma_ for t in nlp(text)]" ] }, { "cell_type": "markdown", "id": "5e140d3d", "metadata": {}, "source": [ "positiveに出現する単語をチェックしてみましょう。\n", "単語のカウントには [collections.Counter](https://docs.python.org/ja/3/library/collections.html#collections.Counter) を使うと便利です。" ] }, { "cell_type": "code", "execution_count": 8, "id": "4e792837", "metadata": {}, "outputs": [], "source": [ "# label毎に出現する単語をチェックする関数を定義します。\n", "\n", "from collections import Counter\n", "\n", "\n", "def check_words(label, most_commont=10):\n", " samples = data.query(f'label == \"{label}\"')\n", "\n", " cnt = Counter()\n", " for words in samples[\"text\"].apply(lambda sent: tokenize(sent)):\n", " cnt.update(words)\n", " \n", " total = sum(cnt.values())\n", " \n", " res = []\n", " \n", " for word, count in cnt.most_common(most_commont):\n", " res.append((word, count, round(count/total, 4)))\n", "\n", " return res" ] }, { "cell_type": "code", "execution_count": 9, "id": "a6010ddf", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[('。', 3149, 0.0821),\n", " ('た', 2565, 0.0669),\n", " ('です', 1823, 0.0475),\n", " ('ます', 1396, 0.0364),\n", " ('の', 1384, 0.0361),\n", " ('も', 1345, 0.0351),\n", " ('が', 971, 0.0253),\n", " ('て', 956, 0.0249),\n", " ('だ', 909, 0.0237),\n", " ('、', 875, 0.0228),\n", " ('は', 829, 0.0216),\n", " ('に', 811, 0.0211),\n", " ('する', 740, 0.0193),\n", " ('で', 485, 0.0126),\n", " ('良い', 477, 0.0124),\n", " ('と', 475, 0.0124),\n", " ('お', 452, 0.0118),\n", " ('を', 366, 0.0095),\n", " ('部屋', 336, 0.0088),\n", " ('ある', 334, 0.0087),\n", " ('とても', 270, 0.007),\n", " ('また', 253, 0.0066),\n", " ('美味しい', 247, 0.0064),\n", " ('利用', 227, 0.0059),\n", " ('!', 223, 0.0058),\n", " ('いる', 221, 0.0058),\n", " ('たい', 212, 0.0055),\n", " ('満足', 206, 0.0054),\n", " ('ホテル', 204, 0.0053),\n", " ('風呂', 201, 0.0052),\n", " ('朝食', 199, 0.0052),\n", " ('思う', 196, 0.0051),\n", " ('できる', 170, 0.0044),\n", " ('から', 158, 0.0041),\n", " ('最高', 146, 0.0038),\n", " ('広い', 130, 0.0034),\n", " ('ない', 129, 0.0034),\n", " ('対応', 126, 0.0033),\n", " ('こと', 125, 0.0033),\n", " ('大', 119, 0.0031),\n", " ('なる', 113, 0.0029),\n", " ('綺麗', 113, 0.0029),\n", " ('よい', 112, 0.0029),\n", " ('行く', 109, 0.0028),\n", " ('いただく', 107, 0.0028),\n", " ('温泉', 104, 0.0027),\n", " ('方', 100, 0.0026),\n", " ('接客', 94, 0.0025),\n", " ('清潔', 90, 0.0023),\n", " ('気持ち', 87, 0.0023)]" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "positive_words = check_words(\"positive\", 50)\n", "positive_words" ] }, { "cell_type": "code", "execution_count": 10, "id": "5cd81f88", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[('。', 727, 0.059),\n", " ('の', 640, 0.0519),\n", " ('が', 638, 0.0518),\n", " ('た', 603, 0.0489),\n", " ('、', 423, 0.0343),\n", " ('です', 365, 0.0296),\n", " ('は', 344, 0.0279),\n", " ('に', 322, 0.0261),\n", " ('だ', 304, 0.0247),\n", " ('ます', 294, 0.0239),\n", " ('て', 291, 0.0236),\n", " ('と', 241, 0.0196),\n", " ('する', 217, 0.0176),\n", " ('も', 159, 0.0129),\n", " ('ない', 157, 0.0127),\n", " ('を', 133, 0.0108),\n", " ('で', 128, 0.0104),\n", " ('ある', 117, 0.0095),\n", " ('か', 112, 0.0091),\n", " ('残念', 109, 0.0088),\n", " ('.', 107, 0.0087),\n", " ('部屋', 103, 0.0084),\n", " ('お', 88, 0.0071),\n", " ('いる', 84, 0.0068),\n", " ('ぬ', 84, 0.0068),\n", " ('風呂', 65, 0.0053),\n", " ('なる', 62, 0.005),\n", " ('少し', 61, 0.005),\n", " ('思う', 60, 0.0049),\n", " ('ただ', 52, 0.0042),\n", " ('?', 50, 0.0041),\n", " ('朝食', 46, 0.0037),\n", " ('時', 44, 0.0036),\n", " ('から', 43, 0.0035),\n", " ('れる', 43, 0.0035),\n", " ('狭い', 40, 0.0032),\n", " ('てる', 37, 0.003),\n", " ('気', 37, 0.003),\n", " ('こと', 36, 0.0029),\n", " ('良い', 35, 0.0028),\n", " ('な', 32, 0.0026),\n", " ('方', 31, 0.0025),\n", " ('悪い', 30, 0.0024),\n", " ('ば', 30, 0.0024),\n", " ('・', 30, 0.0024),\n", " ('ちょっと', 29, 0.0024),\n", " ('ホテル', 27, 0.0022),\n", " ('しまう', 25, 0.002),\n", " ('だけ', 25, 0.002),\n", " ('ね', 24, 0.0019)]" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "negative_words = check_words(\"negative\", 50)\n", "negative_words" ] }, { "cell_type": "markdown", "id": "41fa08ec", "metadata": {}, "source": [ "結果を見ると、\n", "positiveラベルでは「美味しい」「利用」「満足」「綺麗」「清潔」といったポジティブな単語が、\n", "negativeラベルでは「残念」「狭い」「悪い」といったネガティブな単語が特徴的にあらわれていることがわかります。\n", "\n", "一方で、句読点「。」や助詞「が」といった共通の単語も多く存在していることもわかります。\n", "このような共通の単語はストップワードとして定義するとよいでしょう。" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.6" } }, "nbformat": 4, "nbformat_minor": 5 }