分类问题中概率与分数

什么是分数

对于二分类器来说，将其中一个类别称为正类，更高的分数意味着模型更有信心输入（observation）属于正类。

什么是概率

概率可能很难定义。例如，如果我们做一个降雨预测器，它告诉我们今天下雨的几率为20%，我们如何判断它是否正确？在检查预测时，要么会下雨（要么不会下雨）。标准的方法是，如果我们采取看起来与今天相似的日子，那么其中20%的日子应该会下雨，但随后我们就会争论不同的日子必须有多相似才能被组合在一起。在更极端的情况下，例如预测选举结果，很难看出我们如何“重复”实验来赋予结果以意义。

在《信号与噪音》中，内特·西尔弗在第4章的天气预报中谈到了这个问题。与其担心是什么让日子彼此相似，不如采取不同的方法。从历史上看，在那些模型预测20%的时间会下雨日子里，实际下雨的时间百分比是多少？如果答案与20%大相径庭，那么模型的概率在实际估计概率方面做得不是很好。

在实践中，我们假设概率，因为可能只有一天正好有0.2000的降雨概率。如果我们把所有预测在17.5%到22.5%之间，并把它们组合在一个桶里，我们预计实际降雨量在17.5%到22.5%之间。

这种将预测概率与实际发生次数进行比较的过程称为校准（calibration）。校准曲线根据实际发生率绘制预测概率。

概率校准曲线

下雨预测的例子，使用Kaggle数据集

1 2	`rain = pd.read_csv('seattleWeather_1948-2017.csv').dropna() rain.head()`

	DATE	PRCP	TMAX	TMIN	RAIN
0	1948-01-01	0.47	51	42	True
1	1948-01-02	0.59	45	36	True
2	1948-01-03	0.42	45	35	True
3	1948-01-04	0.31	45	34	True
4	1948-01-05	0.17	45	32	True

import pandas as pd
from sklearn.calibration import calibration_curve, CalibratedClassifierCV
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB

# load the data into a dataframe
rain = pd.read_csv('seattleWeather_1948-2017.csv').dropna()
rain.head()

# use the temperatures to predict whether or not there was rain that day
features = rain[['TMIN', 'TMAX']]
target = rain.RAIN.astype(bool)
X_train, X_test, y_train, y_test = train_test_split(features, target, random_state=42)

# use a Naive Bayes Classifier and make hard predictions
nb_rain = GaussianNB().fit(X_train, y_train)
accuracy = nb_rain.score(X_train, y_train)
print(f'The hard predictions were right {100*accuracy:5.2f}% of the time')

# use the predict_proba method to get the probability of the positive case
predictions = nb_rain.predict_proba(X_train)[:, 1]

# call calibration_curve to "bin" similar predicted probabilities together, and calculate what percentage of the time
# it actually rains in that bin
binned_true_p, binned_predict_p = calibration_curve(y_train, predictions, n_bins=10)

调用calibration_curve将相似的预测概率分箱在一起，并计算在该绑定中实际下雨的时间百分比。

校准曲线

如果我们的分类器输出良好的概率，那么校准曲线上这些点将接近这条红线。校准曲线上的点告诉我们，当我们的分类器声称下雨的概率约为25%时，实际上有33%的时间下雨了。

在这个例子中，朴素贝叶斯分类器给出了看起来像概率的分数，并且是由predict_proba生成的，但当我们查看校准曲线时，我们发现它们实际上只是分数。

重新校准和普拉特缩放

当模型的predict_proba方法给出分数而不是概率时，我们可以重新校准分数，使其更接近概率。这个过程被称为普拉特缩放（Platt scaling），在Scikit-learn中作为CalibratedClassifierCV实现。

Platt scaling是一个两参数(two-parameter)优化问题，而优化的目标为：

min-\sum_i{t_i{\rm log}(p_i)+(1-t_i){\rm log}(1-p_i)}

其中

t_i=\frac{y_i+1}{2}

p_i = \frac{1}{1+e^{Af(x_i)+B}}

$f$ 表示一个二分类模型，y的值为-1或+1，A和B就是普拉特缩放要学习的参数。

在天气示例中，我们可以将普拉特缩放应用于我们的朴素贝叶斯分类器。

calibrated_nb_rain = CalibratedClassifierCV(nb_rain, cv=5, method='isotonic')
# Note we need to refit!
calibrated_nb_rain.fit(X_train, y_train);

# access the probabilities after recalibration
calibrated_probs = calibrated_nb_rain.predict_proba(X_test)[:, 1]
binned_true_p, binned_predict_p = calibration_curve(y_test, calibrated_probs, n_bins=10)

修正后的概率更接近红线。

机器学习

#机器学习

分类问题中概率与分数

https://wangyinan.cn/分类问题中概率与分数

作者

yinan

发布于

2023年8月20日

许可协议

fastboot安装驱动上一篇

分类概率转分数下一篇