{ "cells": [ { "cell_type": "markdown", "id": "0c97d6cd", "metadata": {}, "source": [ "# ランダムフォレスト\n", "\n", "ランダムフォレストについては以下を参照してください。\n", "[https://scikit-learn.org/stable/modules/ensemble.html#forest](https://scikit-learn.org/stable/modules/ensemble.html#forest)" ] }, { "cell_type": "markdown", "id": "beaeafb3", "metadata": {}, "source": [ "**データとモジュールのロード**\n", "\n", "学習に使うデータをロードします。" ] }, { "cell_type": "code", "execution_count": 1, "id": "aa65ec88", "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "from sklearn import model_selection\n", "\n", "data = pd.read_csv(\"input/pn_same_judge_preprocessed.csv\")\n", "train, test = model_selection.train_test_split(data, test_size=0.1, random_state=0)" ] }, { "cell_type": "code", "execution_count": 2, "id": "a8500c05", "metadata": {}, "outputs": [], "source": [ "from sklearn.pipeline import Pipeline\n", "from sklearn.feature_extraction.text import TfidfVectorizer\n", "from sklearn.metrics import PrecisionRecallDisplay" ] }, { "cell_type": "markdown", "id": "0c69e19a", "metadata": {}, "source": [ "## 決定木\n", "\n", "ランダムフォレストは決定木の\n", "[バギング](https://ja.wikipedia.org/wiki/バギング)\n", "によりアンサンブル学習する手法なので、\n", "まずは決定木から始めましょう。\n", "\n", "決定木を学習するには\n", "[sklearn.tree.DecisionTreeClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html)\n", "を使います。\n", "ここでは正則化のために `max_depth`, `min_samples_leaf` パラメータを指定しています。" ] }, { "cell_type": "code", "execution_count": 3, "id": "d0497415", "metadata": {}, "outputs": [], "source": [ "from sklearn.tree import DecisionTreeClassifier" ] }, { "cell_type": "code", "execution_count": 4, "id": "817bb85b", "metadata": {}, "outputs": [], "source": [ "pipe_dt = Pipeline([\n", " (\"vect\", TfidfVectorizer(tokenizer=str.split)),\n", " (\"clf\", DecisionTreeClassifier(max_depth=2, min_samples_leaf=10, random_state=0)),\n", "])" ] }, { "cell_type": "code", "execution_count": 5, "id": "77ec26e9", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
Pipeline(steps=[('vect',\n", " TfidfVectorizer(tokenizer=<method 'split' of 'str' objects>)),\n", " ('clf',\n", " DecisionTreeClassifier(max_depth=2, min_samples_leaf=10,\n", " random_state=0))])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
Pipeline(steps=[('vect',\n", " TfidfVectorizer(tokenizer=<method 'split' of 'str' objects>)),\n", " ('clf',\n", " DecisionTreeClassifier(max_depth=2, min_samples_leaf=10,\n", " random_state=0))])
TfidfVectorizer(tokenizer=<method 'split' of 'str' objects>)
DecisionTreeClassifier(max_depth=2, min_samples_leaf=10, random_state=0)
Pipeline(steps=[('vect',\n", " TfidfVectorizer(tokenizer=<method 'split' of 'str' objects>)),\n", " ('clf',\n", " BaggingClassifier(base_estimator=DecisionTreeClassifier(splitter='random'),\n", " n_estimators=1000, n_jobs=-1,\n", " random_state=0))])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
Pipeline(steps=[('vect',\n", " TfidfVectorizer(tokenizer=<method 'split' of 'str' objects>)),\n", " ('clf',\n", " BaggingClassifier(base_estimator=DecisionTreeClassifier(splitter='random'),\n", " n_estimators=1000, n_jobs=-1,\n", " random_state=0))])
TfidfVectorizer(tokenizer=<method 'split' of 'str' objects>)
BaggingClassifier(base_estimator=DecisionTreeClassifier(splitter='random'),\n", " n_estimators=1000, n_jobs=-1, random_state=0)
DecisionTreeClassifier(splitter='random')
DecisionTreeClassifier(splitter='random')
Pipeline(steps=[('vect',\n", " TfidfVectorizer(tokenizer=<method 'split' of 'str' objects>)),\n", " ('clf',\n", " RandomForestClassifier(n_estimators=1000, n_jobs=-1,\n", " random_state=0))])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
Pipeline(steps=[('vect',\n", " TfidfVectorizer(tokenizer=<method 'split' of 'str' objects>)),\n", " ('clf',\n", " RandomForestClassifier(n_estimators=1000, n_jobs=-1,\n", " random_state=0))])
TfidfVectorizer(tokenizer=<method 'split' of 'str' objects>)
RandomForestClassifier(n_estimators=1000, n_jobs=-1, random_state=0)
Pipeline(steps=[('vect',\n", " TfidfVectorizer(tokenizer=<method 'split' of 'str' objects>)),\n", " ('clf',\n", " GradientBoostingClassifier(n_estimators=1000,\n", " random_state=0))])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
Pipeline(steps=[('vect',\n", " TfidfVectorizer(tokenizer=<method 'split' of 'str' objects>)),\n", " ('clf',\n", " GradientBoostingClassifier(n_estimators=1000,\n", " random_state=0))])
TfidfVectorizer(tokenizer=<method 'split' of 'str' objects>)
GradientBoostingClassifier(n_estimators=1000, random_state=0)