优秀的编程知识分享平台

网站首页 > 技术文章 正文

Sklearn介绍

nanyue 2024-11-24 19:41:21 技术文章 1 ℃

简单概念回顾

监督学习与无监督学习

最大的区别就是有没有标签 工业应用中主要是用监督学习

分类任务和回归任务

能用线性模型,决不用非线性模型(容易过拟合,且计算量太大)

模型的评估

accuracy:很少用,样本不均衡时,易出问题 recall与precision:二者之间的trade off F1-score:综合均衡考量recall与precision AUC:ROC曲线下方面积

特征处理(特征工程)

决定机器学习建模效果的核心 业务经验相关 熟悉相关工具

Sklearn的设计概述

官方文档:
https://scikit-learn.org/stable/

  • Classification
  • Regression
  • Clustering
  • Dimensionality reduction
  • Model selection
  • Preprocessing机器学习流程
  • 获取数据爬虫
    数据库
    数据文件(csv、excel、txt)
  • 数据处理文本处理
    量纲一致
    降维
  • 建立模型分类
    回归
    聚类
  • 评估模型超参数择优
    哪个模型更好简单常用的sklearn API
  • fit:训练模型
  • transform:将数据转换为模型处理后的结果(label会放在test集后面)
  • predict:返回模型预测结果
  • predict_proba:预测概率值
  • score:模型准确率(很少用默认的accuracy,会设置为f1)
  • get_params:获取参数 准备数据数据集划分:Training data(70%) Validation dataTesting data(30%)

实际工作中,大部分的情况下不会完全随机划分,会用已经发生(时间在前的、过去的)数据作为训练集,来预测未来(时间在后的)数据。否则使用未来数据预测过去的数据,会引入一些未来发生的先验信息,是不合理的,容易造成过拟合。

另外也会有其他情况,例如按地域划分。

数据处理

数据集:ML DATASETS

Standardization, or mean removal and variance scaling

In practice we often ignore the shape of the distribution and just transform the data to center it by removing the mean value of each feature, then scale it by dividing non-constant features by their standard deviation.

For instance, many elements used in the objective function of a learning algorithm (such as the RBF kernel of Support Vector Machines or the l1 and l2 regularizers of linear models) assume that all features are centered around zero and have variance in the same order. If a feature has a variance that is orders of magnitude larger than others, it might dominate the objective function and make the estimator unable to learn from other features correctly as expected.

[Should I normalize/standardize/rescale the data](http://www.faqs.org/faqs/ai-faq/neural-nets/part2/section-16.html,"Should I normalize/standardize/rescale the data")

StandardScaler

The preprocessing module further provides a utility class StandardScaler that implements the Transformer API to compute the mean and standard deviation on a training set so as to be able to later reapply the same transformation on the testing set.

MinMaxScaler

Scaling features to lie between a given minimum and maximum value, often between zero and one, or so that the maximum absolute value of each feature is scaled to unit size.

Normalization

Normalization is the process of scaling individual samples to have unit norm. This process can be useful if you plan to use a quadratic form such as the dot-product or any other kernel to quantify the similarity of any pair of samples.

This assumption is the base of the Vector Space Model often used in text classification and clustering contexts.

Normalizer类也拥有fit、transform等转换器API拥有的常见方法,但实际上fit和transform对其是没有实际意义的,因为归一化操作是对每个样本单独进行变换,不存在针对所有样本上的统计学习过程。这里的设计,仅仅是为了供sklearn中的pipeline等API调用时,传入该对象时,各API的方法能够保持一致性,方便使用pipeline。

Binarization(离散化)

Feature binarization is the process of thresholding numerical features to get boolean values. This can be useful for downstream probabilistic estimators that make assumption that the input data is distributed according to a multi-variate Bernoulli distribution. For instance, this is the case for the sklearn.neural_network.BernoulliRBM.

Return indices of half-open bins to which each value of x belongs.

pandas.cut(x, bins, right=True, labels=None, retbins=False, precision=3, include_lowest=False)

pandas.cut是按分位数划分的

Encoding categorical features

We could encode categorical features as integers, but such integer representation can not be used directly with scikit-learn estimators, as these expect continuous input, and would interpret the categories as being ordered, which is often not desired.

One possibility to convert categorical features to features that can be used with scikit-learn estimators is to use a one-of-K or one-hot encoding, which is implemented in OneHotEncoder. This estimator transforms each categorical feature with m possible values into m binary features, with only one active.OneHotEncoder(一般不用)

class sklearn.preprocessing.OneHotEncoder(n_values='auto', categorical_features='all', dtype=<type 'numpy.float64'>, 
                                          sparse=True, handle_unknown='error')

Convert categorical variable into dummy/indicator variables

pandas.get_dummies(data, prefix=None, prefix_sep='_', dummy_na=False, columns=None, sparse=False, drop_first=False)

Imputation of missing values

For various reasons, many real world datasets contain missing values, often encoded as blanks, NaNs or other placeholders. Such datasets however are incompatible with scikit-learn estimators which assume that all values in an array are numerical, and that all have and hold meaning. A basic strategy to use incomplete datasets is to discard entire rows and/or columns containing missing values. However, this comes at the price of losing data which may be valuable (even though incomplete). A better strategy is to impute the missing values, i.e., to infer them from the known part of the data.

The SimpleImputer class provides basic strategies for imputing missing values, either using the mean, the median or the most frequent value of the row or column in which the missing values are located. This class also allows for different missing values encodings.

The imputation strategy:

  1. If “mean”, then replace missing values using the mean along the axis.
  2. If “median”, then replace missing values using the median along the axis.
  3. If “most_frequent”, then replace missing using the most frequent value along the axis.
  4. If “constant”, then replace missing values with fill_value. Can be used with strings or numeric data.class sklearn.impute.SimpleImputer(missing_values=nan, strategy='mean', fill_value=None, verbose=0, copy=True, add_indicator=False)

【一些实践中的 tips】

  1. 尽量不要把包含个别特征缺失值的样本删除,实践中最好使用一些业务经验来做一些合理的推测值的填充,利用好样本
  2. 如果没有合适的推测手段来填充,可以填充一些像-999,-1这样的没有意义的值
  3. 其他一些可能用到的方法:np.nannp.infdf.fillnadf.replace特征选择SelectFromModelThis can be used for feature selection/dimensionality reduction on sample sets, either to improve estimators’ accuracy scores or to boost their performance on very high-dimensional datasets. SelectFromModel is a meta-transformer that can be used along with any estimator that has a coef_ or feature_importances_ attribute after fitting. The features are considered unimportant and removed, if the corresponding coef_ or feature_importances_ values are below the provided threshold parameter. Apart from specifying the threshold numerically, there are built-in heuristics for finding a threshold using a string argument. Available heuristics are “mean”, “median” and float multiples of these like “0.1*mean”.class sklearn.feature_selection.SelectFromModel(estimator, threshold=None, prefit=False, norm_order=1, max_features=None)
  4. L1-based feature selection(实际应用中少于树模型,这里演示用L1正则的模型来选取特征)
  5. Tree-based feature selection(实际应用中优先考虑,这里演示RF)

estimator:对象。构建特征选择实例的基本分类器。如果参数prefit为True,则该参数可以由一个已经训练过的分类器初始化。如果prefit为False,则该参数只能传入没有经过训练的分类器实例

threshold:字符串,浮点数,(可选的)默认为None。该参数指定特征选择的阈值,词语在分类模型中对应的系数值大于该值时被保留,否则被移除。如果该参数为字符串类型,则可设置的值有”mean”表示系数向量值的均值,”median”表示系数向量值的中值,也可以为”0.xmean”或”0.xmedian”。当该参数设置值为None时,如果分类器具有罚项且罚项设置为l1,则阈值为1e-5,否则该值为”mean”

prefit:布尔类型。默认值为False。是否对传入的基本分类器事先进行训练。如果设置该值为True,则需要对传入的基本分类器进行训练,如果设置该值为False,则只需要传入分类器实例即可 注意:实际中,我们对于one-hot处理后的那些列一般不会删除,除非这些列的系数都为0,才会删除

※各特征独立考量

【注意】很少用,因为实际中很难确定需要设置的阈值/个数/比例

Removing features with low variance

VarianceThreshold is a simple baseline approach to feature selection. It removes all features whose variance doesn’t meet some threshold. By default, it removes all zero-variance features, i.e. features that have the same value in all samples.

Univariate feature selection

  1. SelectKBest removes all but the k highest scoring features
  2. SelectPercentile removes all but a user-specified highest scoring percentage of features

These objects take as input a scoring function that returns univariate scores and p-values (or only scores for SelectKBest and SelectPercentile):

  • For regression: f_regression, mutual_info_regression
  • For classification: chi2, f_classif, mutual_info_classif降维(考虑了所有特征间的整体贡献)【注意】一般实际应用的其实也较少难得选取到需要的业务特征机器学习中会使用正则项来惩罚共线性 ```python class sklearn.decomposition.PCA(n_components=None, copy=True, whiten=False, svd_solver='auto', tol=0.0, iterated_power='auto', random_state=None)

class sklearn.decomposition.TruncatedSVD(n_components=2, algorithm='randomized', n_iter=5, random_state=None, tol=0.0)

主成分分析PCA

PCA的工作原理是将原始数据集映射到一个新的空间,在这个空间中,矩阵的新列向量是每个正交的。从数据分析的角度来看,PCA将数据的协方差矩阵转化为能够 "解释 "一定比例的方差的列向量。

  • 最大方差解释(保留的特征数越多,方差解释越大):http://www.cnblogs.com/jerrylead/archive/2011/04/18/2020209.html
  • 最小平方误差解释: http://www.cnblogs.com/jerrylead/archive/2011/04/18/2020216.html

【注意】正常数据集其实很少用PCA,用的最多的是在图像压缩上(如只需要抓住图片中主要的人脸部分)

Truncated SVD(截断的奇异矩阵分解)

TruncatedSVD与PCA非常相似,但不同的是,它直接对样本矩阵X进行工作,而不是对其协方差矩阵进行工作。

Truncated SVD与普通SVD的不同之处在于,它产生的因子化结果的列数是等于我们指定的截断数的。 例如,给定一个n×n矩阵,普通SVD将生成具有n列的矩阵,而截断后的SVD将生成我们指定的列数。

模型评估(一):参数选择

Cross-validation: evaluating estimator performance

训练一个预测模型的参数,并在相同的数据上测试这个模型效果,是一种错误的方式:
一个模型如果只是重复它刚刚看到的样本的标签,会有一个完美的分数,但在尚未看到的数据上却无法预测任何有用的东西。这种情况被称为过度拟合。为了避免这种情况,在进行(有监督的)机器学习实验时,通常的做法是将部分可用数据作为测试集X_test、y_test,将其作为测试集保留出来。 当评估不同设置("超参数")的模型时,例如SVM必须手动设置参数C,如果我们用测试集去选择最优的超参数,那么在测试集上仍然存在着过度拟合的风险。这是因为我们在不断调整超参数值,直到模型在测试集上的表现最佳为止。这样一来,关于测试集的知识就会 "泄露 "到模型中,评估指标不再报告泛化性能。为了解决这个问题,可以将数据集的另一部分作为所谓的 "验证集":在训练集上进行训练,然后在验证集上进行评估模型选择超参数,学习到一个我们认为“最好的”模型后,可以在测试集上进行最终评估。 然而,通过将可用的数据分成三组,我们可以大幅减少可用于学习模型的样本数量,结果可能取决于一对(训练、验证)集的特定随机选择。 解决这个问题的方法是一个叫做交叉验证(CrossValidation,简称CV)的过程。当然,仍应保留一个测试集进行最终评估,但在做CV时,不再需要单独划分出一个验证集。在一种叫k-fold CV的简单交叉验证方法中,训练集被分割成k个小集(其他方法将在下文中描述,但一般遵循相同的原则)。对于每一个k个 "折子",都要遵循以下步骤:

* 使用k折中的k-1份数据作为训练数据来训练一个模型
* 所得的模型在剩余的那1份数据上进行验证(即,它被用作验证集来计算一个性能评估标准,例如accuracy)

通过k折交叉验证所报告的最终模型性能指标,是在k次循环中分别计算出的评估指标值的平均值(具体k折交叉验证的方法和原理请参考sklearn官方文档对这块的解释:https://scikit-learn.org/stable/modules/cross_validation.html )。这种方法在计算资源的开销上可能很昂贵,但不会浪费太多数据(相比固定一个测试集时的情况),这在样本数很少的情况下中是一个很大的优势。

Computing cross-validated metrics

sklearn.model_selection.cross_val_score(estimator, X, y=None, groups=None, scoring=None, cv=None, n_jobs=None, verbose=0, fit_params=None, pre_dispatch='2*n_jobs', error_score=nan)

参数:

  • estimator——用什么模型
  • X——数据集输入
  • y——数据集标签
  • scoring——用什么指标来评估
  • cv——几折交叉验证(默认5,一般设置5-10, 也可以传入一个KFold或Stratified迭代器,但实际上传入整数,默认就是用Stratified迭代器)
  • n_jobs——开n个进程并行计算,默认为1(建议电脑闲置跑程序时设置为-1,让之以电脑最大资源进行并行计算)
  • verbose——是否要将学习过程打印出来(如0或1或2或3,数字越大,打印信息越详细。但有的模型没有学习的过程,如这个perceptrom)
  • error_score——遇到不合理的参数是否要报错Cross validation iteratorscv: int, cross-validation generator or an iterable, optional

sklearn中各种交叉验证的api中的cv参数决定了交叉验证拆分策略。cv可能的输入是:

  • None, to use the default 5-fold cross validation,
  • integer, to specify the number of folds in a (Stratified)KFold,
  • CV splitter,
  • An iterable yielding (train, test) splits as arrays of indices. 其中,cv参数可以传入sklearn中自带的一些cv iterators:
  • K-fold
  • Stratified k-fold
  • Label k-fold
  • Leave-One-Out - LOO
  • Leave-P-Out - LPO
    ... ShuffleSplit则是在原始顺序的数据上,进行随机采样,拼成指定的test_size和train_size的数据供交叉验证
    示意图如下


ShuffleSplit将在每次迭代期间随机采样整个数据集,以生成训练集和测试集。在每次交叉验证的迭代中,test_size和train_size参数控制测试和训练集应该多大。由于是在每次迭代中从整个数据集中进行采样(即有放回的采样),因此ShuffleSplit可能在另一次迭代中再次选择前一次迭代中选择过的样本(注意,KFold即使设置了shuffle参数为True也仍然在每一折的划分中不会有重叠的样本,这是两者之间最大的区别

ShuffleSplit划分时跟图中的classes or groups(类别占比)无关 实际中,我们一般会使用StratifiedKFold(按样本各类别标签分层抽样)方式来做交叉验证划分样本,确保训练集、测试集中各类别样本的比例与原始数据集中一致。

当然类似于上面KFold与ShuffleSplit区别,StratifiedKFold和StratifiedShuffleSplit区别同样是在划分或抽样的方式上,只不过加上了分层的条件,所以StratifiedKFold是在每类样本中都进行K折,StratifiedShuffleSplit是在每类样本中都随机有放回地抽样指定比例的样本后得到的这部分数据作为验证集,其余作为训练集,因此每次划分都考虑到了各个类别间的分布占比,示意图如下

Grid Search: Searching for estimator parameters

A search consists of:

  1. an estimator (regressor or classifier such as sklearn.svm.SVC());
  2. a parameter space;
  3. a method for searching or sampling candidates;
  4. a cross-validation scheme; and
  5. a score function.class sklearn.model_selection.GridSearchCV(estimator, param_grid, scoring=None, n_jobs=None, iid='deprecated', refit=True, cv=None, verbose=0, pre_dispatch='2*n_jobs', error_score=nan, return_train_score=False)参数:
  6. estimator——用什么模型
  7. param__grid——参数字典(key为要寻优的参数名,value为要尝试寻优的值的列表)
  8. scoring——用什么指标来评估(分类器默认用准确率,也可改为'f1'、'roc_auc'等)
  9. cv—— 几折交叉验证(默认5,一般设置5-10, 也可以传入一个KFold或Stratified迭代器,但实际上传入整数默认就是用Stratified迭代器)
  10. n_jobs——开n个进程并行计算,默认为1(建议设置-1,让之并行计算)
  11. verbose——是否要将学习过程打印出来(如0或1或2或3,数字越大,打印信息越详细。但有的模型没有学习的过程,如这个perceptrom)
  12. iid——假设样本是否是独立同分布的(默认是True)
  13. refit——是否需要直接返回在整个训练集上的最佳分类器,默认为True,可直接将这个GridSearchCV实例用于predict
  14. error_score——遇到不合理的参数是否要报错,默认'nan'模型评估(二):评估指标具体可以参考sklearn文档中列明的scoring指标:
    https://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter

例如分类任务可用的scoring指标如下

最近发表
标签列表