{ "cells": [ { "cell_type": "markdown", "id": "6fbb04ce", "metadata": {}, "source": [ "# 概要把握" ] }, { "cell_type": "code", "execution_count": 1, "id": "a58387a9-1be7-48b8-9d25-ed6246232a04", "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np" ] }, { "cell_type": "markdown", "id": "4dc15a91-88e5-4f61-8f18-c1fc2e802ce2", "metadata": {}, "source": [ "## データのロード" ] }, { "cell_type": "code", "execution_count": 2, "id": "4833a43a-15fa-45c7-a156-372ca3afb9cb", "metadata": {}, "outputs": [], "source": [ "pn_df = pd.read_csv(\"input/pn.csv\")" ] }, { "cell_type": "code", "execution_count": 3, "id": "880197f8", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
labeltextjudges
0neutral出張でお世話になりました。{\"0\": 3}
1neutral朝食は普通でした。{\"0\": 3}
2positiveまた是非行きたいです。{\"1\": 3}
3positiveまた利用したいと思えるホテルでした。{\"1\": 3}
4positive駅から近くて便利でした。{\"0\": 1, \"1\": 2}
\n", "
" ], "text/plain": [ " label text judges\n", "0 neutral 出張でお世話になりました。 {\"0\": 3}\n", "1 neutral 朝食は普通でした。 {\"0\": 3}\n", "2 positive また是非行きたいです。 {\"1\": 3}\n", "3 positive また利用したいと思えるホテルでした。 {\"1\": 3}\n", "4 positive 駅から近くて便利でした。 {\"0\": 1, \"1\": 2}" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pn_df.head()" ] }, { "cell_type": "markdown", "id": "1eea0e3e", "metadata": {}, "source": [ "## データサイズを調べる" ] }, { "cell_type": "code", "execution_count": 4, "id": "e50fcf2e-3d72-4f12-b39c-2afb899c23e9", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(5553, 3)" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pn_df.shape" ] }, { "cell_type": "markdown", "id": "3e350ec9", "metadata": {}, "source": [ "## カラムの内容を調べる" ] }, { "cell_type": "code", "execution_count": 5, "id": "d5225be4-e1b4-4f61-a7ae-d5ab0cc1e7ef", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "RangeIndex: 5553 entries, 0 to 5552\n", "Data columns (total 3 columns):\n", " # Column Non-Null Count Dtype \n", "--- ------ -------------- ----- \n", " 0 label 5553 non-null object\n", " 1 text 5553 non-null object\n", " 2 judges 5553 non-null object\n", "dtypes: object(3)\n", "memory usage: 130.3+ KB\n" ] } ], "source": [ "pn_df.info()" ] }, { "cell_type": "code", "execution_count": 6, "id": "ecd39f24", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
countuniquetopfreq
label55533positive3406
text55535553出張でお世話になりました。1
judges555310{\"1\": 3}2835
\n", "
" ], "text/plain": [ " count unique top freq\n", "label 5553 3 positive 3406\n", "text 5553 5553 出張でお世話になりました。 1\n", "judges 5553 10 {\"1\": 3} 2835" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pn_df.describe(include=[object]).T" ] }, { "cell_type": "markdown", "id": "c4895db2", "metadata": {}, "source": [ "## データの欠損をチェック\n", "\n", "欠損値がないことはinfo()からもわかるが、念の為isnull()の結果をカウントして別途調べておくとないことを二重チェックできて安心できます。" ] }, { "cell_type": "code", "execution_count": 7, "id": "5890187f-43f7-46b1-9d7f-200aa7e39d00", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "label 0\n", "text 0\n", "judges 0\n", "dtype: int64" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pn_df.isnull().sum()" ] }, { "cell_type": "markdown", "id": "c91caf61", "metadata": {}, "source": [ "## データのユニーク性をチェック\n", "\n", "データがユニークでないと学習・検証・テストセットに被りが出る可能性があるため、ユニークであることを確認します。\n", "`shape` と `describe` メソッドの結果からもユニークであることがわかりますが、ここでは `nunique` メソッドでチェックします。" ] }, { "cell_type": "code", "execution_count": 8, "id": "6bcc0b56", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "5553" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pn_df[\"text\"].nunique()" ] }, { "cell_type": "markdown", "id": "485b8e62", "metadata": {}, "source": [ "`value_count` の結果が1でないものを数えてもユニークであるかどうかを確認できます。" ] }, { "cell_type": "code", "execution_count": 9, "id": "f2e57570", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# value_countsで1ではない数を数えてもわかります。\n", "sum(pn_df[\"text\"].value_counts() != 1)" ] }, { "cell_type": "markdown", "id": "14c25378-76f4-4f1e-acf6-4e4a2e769542", "metadata": {}, "source": [ "## ラベルの分布の確認" ] }, { "cell_type": "markdown", "id": "df2a4608", "metadata": {}, "source": [ "`groupby` の結果を `describe` することでラベル分布を調べることができます。" ] }, { "cell_type": "code", "execution_count": 10, "id": "f4c339b8", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
textjudges
countuniquetopfreqcountuniquetopfreq
label
negative818818と感じてしまいました。18184{\"-1\": 3}602
neutral13291329出張でお世話になりました。113294{\"0\": 3}749
positive34063406また是非行きたいです。134064{\"1\": 3}2835
\n", "
" ], "text/plain": [ " text judges \n", " count unique top freq count unique top freq\n", "label \n", "negative 818 818 と感じてしまいました。 1 818 4 {\"-1\": 3} 602\n", "neutral 1329 1329 出張でお世話になりました。 1 1329 4 {\"0\": 3} 749\n", "positive 3406 3406 また是非行きたいです。 1 3406 4 {\"1\": 3} 2835" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pn_df.groupby(\"label\").describe(include=[object])" ] }, { "cell_type": "code", "execution_count": 11, "id": "f9cc8eac-59f6-4eec-995b-038b0143b76f", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAX0AAAEaCAYAAAD9iIezAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjUuMSwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/YYfK9AAAACXBIWXMAAAsTAAALEwEAmpwYAAAWIUlEQVR4nO3df7RdZX3n8ffHgGjVShiuDE2goZqW4q9AbwFXu6b+WELAqdGptdBRo4tZ6czgVKurY3S5ho7IFGdVHV2jjHGZMXS0lPHHmFEqTSlTl3VQAiI/ZYiAQzIo0UBEHRmB7/xxntRjepN7b3JzdsLzfq111t37u/c553u8+Lk7z372PqkqJEl9eNzQDUiSJsfQl6SOGPqS1BFDX5I6YuhLUkcMfUnqyGFDN7A3Rx99dC1btmzoNiTpkHLdddd9p6qmZto2a+gneQLwBeCItv8nquqCJB8FfgPY2XZ9bVXdkCTA+4CzgR+2+vXttVYDb2/7v7OqNuztvZctW8bmzZtna1GSNCbJN/e0bS5H+g8BL6yq7yc5HPhikr9o2/6wqj6x2/5nAcvb4zTgEuC0JEcBFwDTQAHXJdlYVffP7+NIkvbVrGP6NfL9tnp4e+ztMt5VwKXtedcARyY5FjgT2FRVO1rQbwJW7l/7kqT5mNOJ3CSLktwA3McouL/cNl2U5MYk701yRKstAe4Ze/rWVttTfff3WpNkc5LN27dvn9+nkSTt1ZxCv6oeqaoVwFLg1CTPAt4KnAj8KnAU8JaFaKiq1lXVdFVNT03NeB5CkrSP5jVls6oeAK4GVlbVvW0I5yHgPwOntt22AceNPW1pq+2pLkmakFlDP8lUkiPb8hOBFwNfb+P0tNk6LwNubk/ZCLwmI6cDO6vqXuBK4Iwki5MsBs5oNUnShMxl9s6xwIYkixj9kbi8qj6b5K+TTAEBbgD+edv/CkbTNbcwmrL5OoCq2pHkQuDatt87qmrHgn0SSdKscjDfT396erqcpy9J85PkuqqanmnbQX1F7qQtW/u5oVs4oO6++CVDtyBpYN57R5I6YuhLUkcMfUnqiKEvSR0x9CWpI4a+JHXE0Jekjhj6ktQRQ1+SOmLoS1JHDH1J6oihL0kdMfQlqSOGviR1xNCXpI4Y+pLUEUNfkjpi6EtSRwx9SeqIoS9JHZk19JM8IclXknwtyS1J/m2rn5Dky0m2JPnzJI9v9SPa+pa2fdnYa7211W9PcuYB+1SSpBnN5Uj/IeCFVfVcYAWwMsnpwLuA91bVM4D7gfPa/ucB97f6e9t+JDkJOAd4JrAS+GCSRQv4WSRJs5g19Gvk+2318PYo4IXAJ1p9A/CytryqrdO2vyhJWv2yqnqoqu4CtgCnLsSHkCTNzZzG9JMsSnIDcB+wCfgG8EBVPdx22QosactLgHsA2vadwD8Yr8/wHEnSBMwp9KvqkapaASxldHR+4oFqKMmaJJuTbN6+ffuBehtJ6tK8Zu9U1QPA1cDzgCOTHNY2LQW2teVtwHEAbftTge+O12d4zvh7rKuq6aqanpqamk97kqRZzGX2zlSSI9vyE4EXA7cxCv9XtN1WA59pyxvbOm37X1dVtfo5bXbPCcBy4CsL9DkkSXNw2Oy7cCywoc20eRxweVV9NsmtwGVJ3gl8FfhI2/8jwJ8m2QLsYDRjh6q6JcnlwK3Aw8D5VfXIwn4cSdLezBr6VXUjcPIM9TuZYfZNVf0I+O09vNZFwEXzb1OStBC8IleSOmLoS1JHDH1J6oihL0kdMfQlqSOGviR1xNCXpI4Y+pLUEUNfkjpi6EtSRwx9SeqIoS9JHTH0Jakjhr4kdcTQl6SOGPqS1BFDX5I6YuhLUkcMfUnqiKEvSR0x9CWpI4a+JHVk1tBPclySq5PcmuSWJG9o9T9Ksi3JDe1x9thz3ppkS5Lbk5w5Vl/ZaluSrD0wH0mStCeHzWGfh4E3V9X1SZ4CXJdkU9v23qr6k/Gdk5wEnAM8E/g54K+S/GLb/AHgxcBW4NokG6vq1oX4IJKk2c0a+lV1L3BvW34wyW3Akr08ZRVwWVU9BNyVZAtwatu2paruBEhyWdvX0JekCZnXmH6SZcDJwJdb6fVJbkyyPsniVlsC3DP2tK2ttqf67u+xJsnmJJu3b98+n/YkSbOYc+gneTLwSeCNVfU94BLg6cAKRv8SePdCNFRV66pquqqmp6amFuIlJUnNXMb0SXI4o8D/WFV9CqCqvj22/cPAZ9vqNuC4sacvbTX2UpckTcBcZu8E+AhwW1W9Z6x+7NhuLwdubssbgXOSHJHkBGA58BXgWmB5khOSPJ7Ryd6NC/MxJElzMZcj/V8DXg3clOSGVnsbcG6SFUABdwO/B1BVtyS5nNEJ2oeB86vqEYAkrweuBBYB66vqlgX7JJKkWc1l9s4Xgcyw6Yq9POci4KIZ6lfs7XmSpAPLK3IlqSOGviR1xNCXpI4Y+pLUEUNfkjpi6EtSRwx9SeqIoS9JHTH0Jakjhr4kdcTQl6SOGPqS1BFDX5I6YuhLUkcMfUnqiKEvSR0x9CWpI4a+JHXE0Jekjhj6ktQRQ1+SOjJr6Cc5LsnVSW5NckuSN7T6UUk2Jbmj/Vzc6kny/iRbktyY5JSx11rd9r8jyeoD97EkSTOZy5H+w8Cbq+ok4HTg/CQnAWuBq6pqOXBVWwc4C1jeHmuAS2D0RwK4ADgNOBW4YNcfCknSZMwa+lV1b1Vd35YfBG4DlgCrgA1ttw3Ay9ryKuDSGrkGODLJscCZwKaq2lFV9wObgJUL+WEkSXs3rzH9JMuAk4EvA8dU1b1t07eAY9ryEuCesadtbbU91Xd/jzVJNifZvH379vm0J0maxZxDP8mTgU8Cb6yq741vq6oCaiEaqqp1VTVdVdNTU1ML8ZKSpGZOoZ/kcEaB/7Gq+lQrf7sN29B+3tfq24Djxp6+tNX2VJckTchcZu8E+AhwW1W9Z2zTRmDXDJzVwGfG6q9ps3hOB3a2YaArgTOSLG4ncM9oNUnShBw2h31+DXg1cFOSG1rtbcDFwOVJzgO+CbyybbsCOBvYAvwQeB1AVe1IciFwbdvvHVW1YyE+hCRpbmYN/ar6IpA9bH7RDPsXcP4eXms9sH4+DUqSFo5X5EpSRwx9SeqIoS9JHTH0Jakjhr4kdcTQl6SOGPqS1BFDX5I6YuhLUkcMfUnqiKEvSR0x9CWpI4a+JHXE0Jekjhj6ktQRQ1+SOmLoS1JHDH1J6oihL0kdMfQlqSOGviR1ZNbQT7I+yX1Jbh6r/VGSbUluaI+zx7a9NcmWJLcnOXOsvrLVtiRZu/AfRZI0m7kc6X8UWDlD/b1VtaI9rgBIchJwDvDM9pwPJlmUZBHwAeAs4CTg3LavJGmCDptth6r6QpJlc3y9VcBlVfUQcFeSLcCpbduWqroTIMllbd9b59+yJGlf7c+Y/uuT3NiGfxa32hLgnrF9trbanuqSpAna19C/BHg6sAK4F3j3QjWUZE2SzUk2b9++faFeVpLEPoZ+VX27qh6pqkeBD/OTIZxtwHFjuy5ttT3VZ3rtdVU1XVXTU1NT+9KeJGkP9in0kxw7tvpyYNfMno3AOUmOSHICsBz4CnAtsDzJCUkez+hk78Z9b1uStC9mPZGb5M+A5wNHJ9kKXAA8P8kKoIC7gd8DqKpbklzO6ATtw8D5VfVIe53XA1cCi4D1VXXLQn8YSdLezWX2zrkzlD+yl/0vAi6aoX4FcMW8upMkLSivyJWkjhj6ktQRQ1+SOmLoS1JHDH1J6siss3ekQ8WytZ8buoUD6u6LXzJ0C3oM8Ehfkjpi6EtSRwx9SeqIoS9JHTH0Jakjhr4kdcTQl6SOGPqS1BFDX5I6YuhLUkcMfUnqiKEvSR0x9CWpI4a+JHXE0Jekjhj6ktSRWUM/yfok9yW5eax2VJJNSe5oPxe3epK8P8mWJDcmOWXsOavb/nckWX1gPo4kaW/mcqT/UWDlbrW1wFVVtRy4qq0DnAUsb481wCUw+iMBXACcBpwKXLDrD4UkaXJmDf2q+gKwY7fyKmBDW94AvGysfmmNXAMcmeRY4ExgU1XtqKr7gU38/T8kkqQDbF/H9I+pqnvb8reAY9ryEuCesf22ttqe6n9PkjVJNifZvH379n1sT5I0k/0+kVtVBdQC9LLr9dZV1XRVTU9NTS3Uy0qS2PfQ/3YbtqH9vK/VtwHHje23tNX2VJckTdC+hv5GYNcMnNXAZ8bqr2mzeE4HdrZhoCuBM5Isbidwz2g1SdIEHTbbDkn+DHg+cHSSrYxm4VwMXJ7kPOCbwCvb7lcAZwNbgB8CrwOoqh1JLgSubfu9o6p2PzksSTrAZg39qjp3D5teNMO+BZy/h9dZD6yfV3eSpAXlFbmS1BFDX5I6YuhLUkcMfUnqiKEvSR2ZdfaOJE3CsrWfG7qFA+bui18ydAt/xyN9SeqIoS9JHTH0Jakjhr4kdcTQl6SOGPqS1BFDX5I6YuhLUkcMfUnqiKEvSR0x9CWpI4a+JHXE0Jekjhj6ktQRQ1+SOrJfoZ/k7iQ3JbkhyeZWOyrJpiR3tJ+LWz1J3p9kS5Ibk5yyEB9AkjR3C3Gk/4KqWlFV0219LXBVVS0HrmrrAGcBy9tjDXDJAry3JGkeDsTwzipgQ1veALxsrH5pjVwDHJnk2APw/pKkPdjf0C/gL5Ncl2RNqx1TVfe25W8Bx7TlJcA9Y8/d2mqSpAnZ3+/I/fWq2pbkacCmJF8f31hVlaTm84Ltj8cagOOPP34/25MkjduvI/2q2tZ+3gd8GjgV+PauYZv28762+zbguLGnL2213V9zXVVNV9X01NTU/rQnSdrNPod+kiclecquZeAM4GZgI7C67bYa+Exb3gi8ps3iOR3YOTYMJEmagP0Z3jkG+HSSXa/z8ar6fJJrgcuTnAd8E3hl2/8K4GxgC/BD4HX78d6SpH2wz6FfVXcCz52h/l3gRTPUCzh/X99PkrT/vCJXkjpi6EtSRwx9SeqIoS9JHTH0Jakjhr4kdcTQl6SOGPqS1BFDX5I6YuhLUkcMfUnqiKEvSR0x9CWpI4a+JHXE0Jekjhj6ktQRQ1+SOmLoS1JHDH1J6oihL0kdMfQlqSOGviR1ZOKhn2RlktuTbEmydtLvL0k9m2joJ1kEfAA4CzgJODfJSZPsQZJ6Nukj/VOBLVV1Z1X9P+AyYNWEe5Ckbh024fdbAtwztr4VOG18hyRrgDVt9ftJbp9Qb0M4GvjOpN4s75rUO3XD39+h67H+u/v5PW2YdOjPqqrWAeuG7mMSkmyuqumh+9C+8fd36Or5dzfp4Z1twHFj60tbTZI0AZMO/WuB5UlOSPJ44Bxg44R7kKRuTXR4p6oeTvJ64EpgEbC+qm6ZZA8HmS6GsR7D/P0durr93aWqhu5BkjQhXpErSR0x9CWpI4a+JHXE0JfUjSRPTPJLQ/cxJEN/wjLyqiT/pq0fn+TUofuSHuuS/CZwA/D5tr4iSXdTxp29M2FJLgEeBV5YVb+cZDHwl1X1qwO3pr1I8iAw0/9ZAlRV/eyEW9I8JbkOeCHwP6rq5Fa7qaqePWxnk3XQ3YahA6dV1SlJvgpQVfe3C9V0EKuqpwzdg/bbj6tqZ5LxWndHvYb+5P243WK6AJJMMTry1yEkydOAJ+xar6r/PWA7mptbkvwusCjJcuD3gS8N3NPEOaY/ee8HPg08LclFwBeBfzdsS5qrJC9NcgdwF/A3wN3AXwzalObqXwHPBB4CPg7sBN44ZENDcEx/AElOBF7EaDz4qqq6beCWNEdJvsZoXPivqurkJC8AXlVV5w3cmmaR5JSqun7oPobmkf6EJXk/cFRVfaCq/qOBf8j5cVV9F3hcksdV1dVAl7foPQS9O8ltSS5M8qyhmxmKoT951wFvT/KNJH+SxMA4tDyQ5MnAF4CPJXkf8IOBe9IcVNULgBcA24EPJbkpydsHbmviHN4ZSJKjgN9idHvp46tq+cAtaQ6SPAn4v4wOmP4p8FTgY+3oX4eIJM8G/jXwO1XV1ew5Z+8M5xnAiYy+1swhnkNAm3X12XbE+CiwYeCWNA9Jfhn4HUYHW98F/hx486BNDcDQn7Ak/x54OfANRv/RXVhVDwzalOakqh5J8miSp1bVzqH70bytZ/T/uTOr6v8M3cxQDP3J+wbwvKqa2Jcya0F9H7gpySbGxvKr6veHa0lzUVXPG7qHg4Fj+hOS5MSq+nqSU2ba7lSyQ0OS1TOUq6ounXgzmpMkl1fVK5PcxE9fgbvrFhrPGai1QXikPzlvAtYA755hWzGa+62D35FV9b7xQpI3DNWM5mTX7+cfD9rFQcIj/QlL8oSq+tFsNR2cklxfVafsVvvqrht46eCV5F1V9ZbZao91ztOfvJnu9dHd/T8ONUnOTfLfgROSbBx7XA3sGLo/zcmLZ6idNfEuBubwzoQk+YfAEuCJSU5mNJ4I8LPAzwzWmObqS8C9wNH89BDdg8CNg3SkOUnyL4B/CfxCkvHf1VOAvx2mq+E4vDMh7QTgaxldsr95bNODwEer6lND9CU91iV5KrAY+GNg7dimB6uqu3+lGfoTluS3quqTQ/ehfbPbl6k8Hjgc+IFfonLo6P222A7vTEiSV1XVfwGWJXnT7tur6j0DtKV5Gv8ylYy+jWMVcPpwHWmu2tclvgf4OeA+fnI1/DOH7GvSPJE7OU9qP5/MaCxx94cOMTXy34Azh+5Fc/JORn+g/1dVncDo9ubXDNvS5Dm8I81Dkn8ytvo4RudofsOrPQ9+STZX1XT7ToSTq+rRJF+rqucO3dskObwzYe3eO+9kdKfGzwPPAf6gDf3o4PebY8sPM/rmrFXDtKJ52v222PfR4W2xPdKfsCQ3VNWKJC9ndIXgm4Av9Ha0IU1auy32jxhNl+72ttge6U/erv/NXwL816raOTofqENBkl8ELgGOqapnJXkO8NKqeufArWkWVTV+VN/tbbE9kTt5n03ydeBXgKuSTDE6+tCh4cPAW4EfA1TVjYy+CEcHuSQPJvnebo97knw6yS8M3d+keKQ/YVW1to3r72z3Z/8BjgkfSn6mqr6y27/OHh6qGc3LfwC2Ah9nNMRzDvB04HpG99p//lCNTZKhP2FJDgdeBfyjFhx/A/ynQZvSfHwnydNpF2gleQWj2zPo4PfS3c6drWvn2N6S5G2DdTVhhv7kXcLoKs4PtvVXt9o/G6wjzcf5wDrgxCTbgLsYnRTUwe+HSV4JfKKtv4KfDK12M6PF2TsTNtO84B7nCh+qkhzBKCyWAUcB32N0ndY7huxLs2vj9u8Dnsco5K8B/gDYBvxKVX1xwPYmxiP9yXskydOr6hvwd/8hPjJwT5q7zwAPMBoH7vZ7Vg9FVXUnP32dxbguAh8M/SH8IXB1kjvb+jLgdcO1o3laWlUrh25C8+d02xGnbE7e3wIfAh5l9OUbHwL+56AdaT6+lOTZQzehfeJ0WzzSH8KljMaBL2zrvwv8KfDbg3Wk+fh14LVJ7gIeotMv1z5EOd0WQ38Iz6qqk8bWr05y62DdaL66+3q9xxCn22LoD+H6JKdX1TUASU7jp79JSwexqvrm0D1onzndFqdsTlyS24BfAnZ9W8/xwO2M/pnpMIF0gDjddsQj/clz5oc0DKfb4pG+pE4kubmqnjV0H0NzyqakXjjdFo/0JXWizZJ7BqMTuN1OtzX0JXUhyc/PVO9tRpahL0kdcUxfkjpi6EtSRwx9SeqIoS9JHTH0Jakj/x8Hn3iXewQayQAAAABJRU5ErkJggg==\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "pn_df[\"label\"].value_counts().plot(kind=\"bar\")" ] }, { "cell_type": "markdown", "id": "ff806b53-99e5-432a-96bd-8dec55c05830", "metadata": {}, "source": [ "## データの目視チェック\n", "\n", "データを目視でチェックします。以外と忘れられているステップかもしれませんが最も重要なステップの一つで、\n", "以下の問題点を洗い出します。\n", "\n", "* データ自体の問題\n", "* アノテーションの問題\n", "\n", "次のステップに進む前に入念にデータに問題がないかをチェックしましょう。" ] }, { "cell_type": "code", "execution_count": 12, "id": "c6817d8d-8b5c-4435-96d1-fcf25566c47c", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
labeltextjudges
1521positive枕も選べます。{\"1\": 3}
5010positiveおやじ3人で宿泊しましたが、いたるところに気配りがなされていて、のんびりできました。{\"1\": 3}
2363positive想像以上に快適でした。{\"1\": 3}
3896positive朝食大満足。{\"1\": 3}
744positiveまたきます。{\"0\": 1, \"1\": 2}
\n", "
" ], "text/plain": [ " label text judges\n", "1521 positive 枕も選べます。 {\"1\": 3}\n", "5010 positive おやじ3人で宿泊しましたが、いたるところに気配りがなされていて、のんびりできました。 {\"1\": 3}\n", "2363 positive 想像以上に快適でした。 {\"1\": 3}\n", "3896 positive 朝食大満足。 {\"1\": 3}\n", "744 positive またきます。 {\"0\": 1, \"1\": 2}" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 繰り返し実行して中身をチェックします\n", "pn_df.query('label == \"positive\"').sample(n=5)" ] }, { "cell_type": "code", "execution_count": 13, "id": "418560b6-e42c-49ef-a60d-f4df2d88db37", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
labeltextjudges
1619negativeアメニティが少ない。{\"-1\": 3}
528negative繁忙・ハイシーズンと平常時と比べて、料金差が大きい。{\"-1\": 3}
57negative今回は残念でした。{\"-1\": 3}
3794negativeフロントの対応も悪い。{\"-1\": 3}
3766negative露天風呂は小さめ。{\"-1\": 3}
\n", "
" ], "text/plain": [ " label text judges\n", "1619 negative アメニティが少ない。 {\"-1\": 3}\n", "528 negative 繁忙・ハイシーズンと平常時と比べて、料金差が大きい。 {\"-1\": 3}\n", "57 negative 今回は残念でした。 {\"-1\": 3}\n", "3794 negative フロントの対応も悪い。 {\"-1\": 3}\n", "3766 negative 露天風呂は小さめ。 {\"-1\": 3}" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pn_df.query('label == \"negative\"').sample(n=5)" ] }, { "cell_type": "markdown", "id": "3b6cf55e", "metadata": {}, "source": [ "## Pandas Profilingを使う\n", "\n", "https://github.com/ydataai/pandas-profiling\n", "\n", "```{note}\n", "Pandas Profilingを使うには事前にパッケージをインストールしておく必要があります。\n", "\n", " $ pip install pandas-profiling[notebook]==3.2.0\n", " \n", "```" ] }, { "cell_type": "code", "execution_count": 14, "id": "98f44515", "metadata": {}, "outputs": [], "source": [ "from pandas_profiling import ProfileReport\n", "\n", "profile = ProfileReport(pn_df, title=\"Pandas Profiling Report\", explorative=True)" ] }, { "cell_type": "code", "execution_count": 15, "id": "e38978c9", "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "766d5f6d975148a4a0a639b0c5954763", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Summarize dataset: 0%| | 0/5 [00:00