ロジスティック回帰
Contents
ロジスティック回帰¶
ロジスティック回帰全般については以下を参照してください。 https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
データとモジュールのロード
import pandas as pd
from sklearn import model_selection
data = pd.read_csv("input/pn_same_judge_preprocessed.csv")
train, test = model_selection.train_test_split(data, test_size=0.1, random_state=0)
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import ConfusionMatrixDisplay
from sklearn.metrics import PrecisionRecallDisplay
LogisticRegression¶
sklearn.linear_model.LogisticRegression を使います。
from sklearn.linear_model import LogisticRegression
pipe = Pipeline([
("vect", TfidfVectorizer(tokenizer=str.split)),
("clf", LogisticRegression())
])
pipe.fit(train["tokens"], train["label_num"])
Pipeline(steps=[('vect', TfidfVectorizer(tokenizer=<method 'split' of 'str' objects>)), ('clf', LogisticRegression())])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('vect', TfidfVectorizer(tokenizer=<method 'split' of 'str' objects>)), ('clf', LogisticRegression())])
TfidfVectorizer(tokenizer=<method 'split' of 'str' objects>)
LogisticRegression()
pred = pipe.predict(test["tokens"])
ConfusionMatrixDisplay.from_predictions(y_true=test["label_num"], y_pred=pred)
<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay at 0x7f6a74c22d60>
score = pipe.predict_proba(test["tokens"])[:,1]
PrecisionRecallDisplay.from_predictions(
y_true=test["label_num"],
y_pred=score,
name="LogisticRegression",
)
<sklearn.metrics._plot.precision_recall_curve.PrecisionRecallDisplay at 0x7f6a74b668b0>
不均衡データに対応する¶
class_weight
パラメータで不均衡データに対応できます。
pipe_weight = Pipeline([
("vect", TfidfVectorizer(tokenizer=str.split)),
("clf", LogisticRegression(class_weight="balanced"))
])
pipe_weight.fit(train["tokens"], train["label_num"])
Pipeline(steps=[('vect', TfidfVectorizer(tokenizer=<method 'split' of 'str' objects>)), ('clf', LogisticRegression(class_weight='balanced'))])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('vect', TfidfVectorizer(tokenizer=<method 'split' of 'str' objects>)), ('clf', LogisticRegression(class_weight='balanced'))])
TfidfVectorizer(tokenizer=<method 'split' of 'str' objects>)
LogisticRegression(class_weight='balanced')
score_weight = pipe_weight.predict_proba(test["tokens"])[:,1]
class_weightオプションを付けないモデルと比較します。
import matplotlib.pyplot as plt
_, ax = plt.subplots()
for name, pred in [
("LogisticRegression", score),
("LogisticRegression+weight", score_weight),
]:
PrecisionRecallDisplay.from_predictions(ax=ax, y_true=test["label_num"], y_pred=pred, name=name)