网站首页 > 技术文章 正文
数据集:
训练集:892 x 12:892人,11个特征,1个标签。
测试集:418 x 11:418人,11个特征。
变量:
survival: 生存状况,0 = No, 1 = Yes;即标签
pclass: 船票级别,1 = 1st, 2 = 2nd, 3 = 3rd
sex: 性别
Age: 年级
sibsp: Titanic上兄弟姐妹/和配偶的数目量
parch:Titanic上父母及孩子的数量,若孩子仅与保姆出行,parch=0
ticket: 船票号码
fare: 旅客票价
cabin: 船舱类型
embarked: 出发港口,C = Cherbourg, Q = Queenstown, S = Southampton
PassengerId:乘客序号
Name:姓名
导入相关依赖库:
%matplotlib inline
import pandas as pd
import numpy as np
import re
import xgboost as xgb
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')
from sklearn.svm import SVC
from sklearn.cross_validation import KFold
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.metrics import accuracy_score, log_loss
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier,GradientBoostingClassifier, ExtraTreesClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis, QuadraticDiscriminantAnalysis
查看数据集:
train=pd.read_csv('train.csv')
test=pd.read.csv('test.csv')
train.head(5)
#查看数据集详细信息:数据类型,是否有缺失值等。
print(train.info) #与此类似的函数train['Age'].describe,输出'Age'的统计信息
#输出结果
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId 891 non- int64
Survived 891 non- int64
Pclass 891 non- int64
Name 891 non- object
Sex 891 non- object
Age 714 non- float64
SibSp 891 non- int64
Parch 891 non- int64
Ticket 891 non- object
Fare 891 non- float64
Cabin 204 non- object
Embarked 889 non- object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB
None
测试集:
print(test.info)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
PassengerId 418 non- int64
Pclass 418 non- int64
Name 418 non- object
Sex 418 non- object
Age 332 non- float64
SibSp 418 non- int64
Parch 418 non- int64
Ticket 418 non- object
Fare 417 non- float64
Cabin 91 non- object
Embarked 418 non- object
dtypes: float64(2), int64(4), object(5)
memory usage: 36.0+ KB
None
可以看出,Fare,Carbin, Age,Embark有缺失值。
填充缺失值前,先查看数据分布,利用数据可视化。
数据可视化:
#1.画‘Survived’标签的饼图,生成比率38.38%。
train['Survived'].value_counts.plot.pie(autopct = '%1.2f%%')
#sns画出'Embarked'的直方图,从S港口出发的最多。
sns.countplot(y='Embarked',data=train,hue_order='ALIGN',saturation=1)
缺失值处理:
1. 如果数据集很大,缺失数据少,直接删除。
2. 如果某个特征对于预测不那么重要,对缺失值赋均值或取众数处理。'Embarked'特征缺失两个值,缺失值填充‘S’即可。
train['Embarked'] = train['Embarked'].fillna('S')
train['Fare'] = train['Fare'].fillna(train['Fare'].median)
3. 对于某些类别特征,可以赋一个代表缺失的值,比如‘U0’。因为缺失本身也可能代表着一些隐含信息。比如船舱号Cabin这一属性,缺失可能代表并没有船舱。
4. 使用模型预测缺失值。‘Age’在本例中是一个重要的特征,并且缺失数据较多,故对填充值的准确性有一定要求。一般情况下,使用数据完整的样例来预测缺失值,最后再填充。也有将缺失值填充为‘Age’的(mean - std) 和 (mean + std)的随机数。
#方法一:填充随机数
age_avg = train['Age'].mean
age_std = train['Age'].std
age__count = train['Age'].is.sum
age__random_list = np.random.randint(age_avg - age_std,age_avg + age_std, size=age__count)
train['Age'][np.isnan(train['Age'])]=age__random_list
train['Age'] = train['Age'].astype(int)#方法二:用随机森林回归对缺失值做预测!
from sklearn.ensemble import RandomForestRegressor
#选取训练集预测‘Age’
age_df = train[['Age','Survived','Fare', 'Parch', 'SibSp', 'Pclass']]
age_df_not = age_df.loc[(train['Age'].not)]
age_df_is = age_df.loc[(train['Age'].is)]
X = age_df_not.values[:,1:]
Y = age_df_not.values[:,0]
#使用随机森林回归训练模型
RFR = RandomForestRegressor(n_estimators=1000, n_jobs=-1)
RFR.fit(X,Y)
predictAges = RFR.predict(age_df_is.values[:,1:])
train.loc[train['Age'].is, ['Age']]= predictAges
查看缺失值处理后的数据集信息:
print(train.info)
数据相关性分析
#查看‘Pclass’对‘Survived’的影响
print(train[['Pclass', 'Survived']].groupby(['Pclass'], as_index=False).mean)
#或可视化上述结果
train[['Pclass','Survived']].groupby(['Pclass']).mean.plot.bar
#查看'Sex'和‘Pclass’对'Survived'的影响,可以看出不同性别(‘female'=0,'male'=1)和等级的情况下,对生存率的影响。
train[['Sex','Pclass','Survived']].groupby(['Pclass','Sex']).mean.plot.bar
结论:不同等级的船舱,女士生存率都较高,但不同等级的船舱还是有一定区别。
#查看'Age'与'Survived'的关系,使用小提琴图。
fig, ax = plt.subplots(1, 2, figsize = (18, 8))
sns.violinplot("Pclass", "Age", hue="Survived", data=train, split=True, ax=ax[0])
ax[0].set_title('Pclass and Age vs Survived')
ax[0].set_yticks(range(0, 110, 10))
sns.violinplot("Sex", "Age", hue="Survived", data=train, split=True, ax=ax[1])
ax[1].set_title('Sex and Age vs Survived')
ax[1].set_yticks(range(0, 110, 10))
plt.show
#将‘Age’等分成5个区间,分组,查看不同组对'Survived'的影响。
train['CategoricalAge'] = pd.cut(train['Age'], 5)
print (train[['CategoricalAge', 'Survived']].groupby(['CategoricalAge'], as_index=False).mean)
#作图
average_age=train[['CategoricalAge', 'Survived']].groupby(['CategoricalAge'], as_index=False).mean
sns.barplot(x='CategoricalAge', y='Survived', data=average_age)#按自己设定的区间分组
bins = [0, 12, 18, 65, 100]
train['Age_group'] = pd.cut(train['Age'], bins)
by_age = train.groupby('Age_group')['Survived'].mean
by_age
#分析总体年龄分布和盒图。
plt.figure(figsize=(12,5))
plt.subplot(121)
train['Age'].hist(bins=70)
plt.xlabel('Age')
plt.ylabel('Num')
plt.subplot(122)
train.boxplot(column='Age', showfliers=False)
plt.show
#分析年龄的统计信息。
train['Age'].describe
#不同'Age'下的'Survived'
facet = sns.FacetGrid(train, hue="Survived",aspect=4)
facet.map(sns.kdeplot,'Age',shade= True)
facet.set(xlim=(0, train['Age'].max))
facet.add_legend
#"Name"特征,我们可以找到人的头衔title,没有title则返回""
def get_title(name):
title_search = re.search(' ([A-Za-z]+)\.', name)
if title_search:
return title_search.group(1)
return ""
train['Title'] = train['Name'].apply(get_title)
print(pd.crosstab(train['Title'], train['Sex']))
#输出结果:Sex female male
Title
Capt 0 1
Col 0 2
Countess 1 0
Don 0 1
Dr 1 6
Jonkheer 0 1
Lady 1 0
Major 0 2
Master 0 40
Miss 182 0
Mlle 2 0
Mme 1 0
Mr 0 517
Mrs 125 0
Ms 1 0
Rev 0 6
Sir 0 1
#查看"Title"对"Survived"生还率的影响。用“Miss”代替"Mlle"等。
train['Title'] = train['Title'].replace(['Lady', 'Countess','Capt', 'Col','Don', 'Dr', 'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona'], 'Rare')
train['Title'] =train['Title'].replace('Mlle', 'Miss')
train['Title'] = train['Title'].replace('Ms', 'Miss')
train['Title'] = train['Title'].replace('Mme', 'Mrs')
#查看'Title'对'Survived'的影响。
print(train[['Title', 'Survived']].groupby(['Title'], as_index=False).mean)
#有无兄弟姐妹‘SibSp’和存活‘Survived’与否的关系
sibsp_df = train[train['SibSp'] != 0]
no_sibsp_df = train[train['SibSp'] == 0]
plt.figure(figsize=(10,5))
plt.subplot(121)
sibsp_df['Survived'].value_counts.plot.pie(labels=['No Survived', 'Survived'], autopct = '%1.1f%%')
plt.xlabel('sibsp')
plt.subplot(122)
no_sibsp_df['Survived'].value_counts.plot.pie(labels=['No Survived', 'Survived'], autopct = '%1.1f%%')
plt.xlabel('no_sibsp')
plt.show
#有无父母子女和存活与否的关系 Parch
parch_df = train[train['Parch'] != 0]
no_parch_df = train[train['Parch'] == 0]
plt.figure(figsize=(10,5))
plt.subplot(121)
parch_df['Survived'].value_counts.plot.pie(labels=['No Survived', 'Survived'], autopct = '%1.1f%%')
plt.xlabel('parch')
plt.subplot(122)
no_parch_df['Survived'].value_counts.plot.pie(labels=['No Survived', 'Survived'], autopct = '%1.1f%%')
plt.xlabel('no_parch')
plt.show
#亲友的人数和存活与否的关系 SibSp & Parch
fig,ax=plt.subplots(1,2,figsize=(18,8))
train[['Parch','Survived']].groupby(['Parch']).mean.plot.bar(ax=ax[0])
ax[0].set_title('Parch and Survived')
train[['SibSp','Survived']].groupby(['SibSp']).mean.plot.bar(ax=ax[1])
ax[1].set_title('SibSp and Survived')
#出行的亲友人数对生存率的影响。可见,出行的亲友太少或太多,生存概率较低。
train_data['Family_Size'] = train_data['Parch'] + train_data['SibSp'] + 1
train_data[['Family_Size','Survived']].groupby(['Family_Size']).mean.plot.bar
#查看票价均值和方差与生存与否的关系
#票价与是否生还有一定的相关性,生还者的平均票价要大于未生还者的平均票价。
fare_not_survived = train['Fare'][train['Survived'] == 0]
fare_survived = train['Fare'][train['Survived'] == 1]average_fare = pd.DataFrame([fare_not_survived.mean(), fare_survived.mean()])
std_fare = pd.DataFrame([fare_not_survived.std(), fare_survived.std()])
average_fare.plot(yerr=std_fare, kind='bar', legend=False)
plt.show
#船舱类型Cabin和存活与否的关系
#由于船舱的缺失值确实太多,有效值仅仅有204个,很难分析出不同的船舱和存
#的关系,所以在做特征工程的时候,可以直接将该组特征丢弃。当然,这里我们也#可以对其进行一下分析,对于缺失的数据都分为一类。简单地将数据分为是否有#Cabin记录作为特征,与生存与否进行分析:
# 用"U0"填充
train.loc[train.Cabin.is(), 'Cabin'] = 'U0'
train['Has_Cabin'] = train['Cabin'].apply(lambda x: 0 if x == 'U0' else 1)
train_data[['Has_Cabin','Survived']].groupby(['Has_Cabin']).mean.plot.bar
#对不同类型的船舱进行分析,将特征‘Cabin’的字母部分创建特征。
#可见,不同的船舱生存率也有不同,但是差别不大。所以在处理中,我们可以直接将特征删除。
train['CabinLetter'] = train['Cabin'].map(lambda x: re.compile("([a-zA-Z]+)").search(x).group)
train['CabinLetter'] = pd.factorize(train['CabinLetter'])[0]
train[['CabinLetter','Survived']].groupby(['CabinLetter']).mean.plot.bar
#港口和存活与否的关系 Embarked
#泰坦尼克号从英国的南安普顿港出发,途径法国瑟堡和爱尔兰昆士敦,那么在昆士敦之前上船的人,有可能在瑟堡或昆士敦下船,这些人将不会遇到海难。
sns.countplot('Embarked', hue='Survived', data=train)
plt.title('Embarked and Survived')
#可知,在C港口上船的人生存率最高,S港口上船的人生存概率最低。
train[['Embarked','Survived']].groupby(['Embarked']).mean.plot.bar
据了解,泰坦尼克号上共有2224名乘客。本训练数据只给出了891名乘客的信息,如果该数据集是从总共的2224人中随机选出的,根据中心极限定理,该样本的数据也足够大,那么我们的分析结果就具有代表性;但如果不是随机选取,那么我们的分析结果就可能不太靠谱了。
其他可能和存活有关系的特征
对于数据集中没有给出的特征信息,我们还可以联想其他可能会对模型产生影响的特征因素。如:乘客的国籍、乘客的身高、乘客的体重、乘客是否会游泳、乘客职业等等。
另外还有数据集中没有分析的几个特征:Ticket(船票号)、Cabin(船舱号),这些因素的不同可能会影响乘客在船中的位置从而影响逃生的顺序。但是船舱号数据缺失,船票号类别大,难以分析规律,所以在后期模型融合的时候,将这些因素交由模型来决定其重要性。
非数值特征的转换
1. Dummy Variables
当某些类别变量出现次数不太多的时候,Dummy Variables比较适合。以'Embarked'为例,Embarked只包含三个值’S’,’C’,’Q’,我们可以使用下面的代码将其转换为dummies:
embark_dummies = pd.get_dummies(train['Embarked'])
train = train.join(embark_dummies)
train.drop(['Embarked'], axis=1,inplace=True)
embark_dummies = train[['S', 'C', 'Q']]
embark_dummies.head
2. Factoring
当类别变量太多的时候,不宜使用Dummy Variables。此时,使用pandas提供的factorize,将其每个特征映射为一个ID。以”Cabin”为例:
# 用"U0"代替‘Cabin’缺失值
train['Cabin'][train.Cabin.is()] = 'U0'
train['CabinLetter'] = train['Cabin'].map( lambda x : re.compile("([a-zA-Z]+)").search(x).group)
train['CabinLetter'] = pd.factorize(train['CabinLetter'])[0]
3.Scaling
当某些特征的变化范围太大,使用scaling将特征映射到更小的范围,通常是(-1,1)。以"Age"为例:
from sklearn import preprocessing
assert np.size(train['Age']) == 891
scaler = preprocessing.StandardScaler
train['Age_scaled'] = scaler.fit_transform(train['Age'].values.reshape(-1, 1))
4. Binning
将"相似"的数据进行划分(类似于聚类),Binning后,要么factorize化,要么dummies化。以“Fare”为例:
# factorize
train['Fare_bin'] = pd.qcut(train['Fare'], 5)
train['Fare_bin'].head
train['Fare_bin_id'] = pd.factorize(train['Fare_bin'])[0]
# dummies
fare_bin_dummies_df = pd.get_dummies(train['Fare_bin']).rename(columns=lambda x: 'Fare_' + str(x))
# pd.concat([X1, X1], axis=1)按行拼接起来,行不变,列数相加。
train_data = pd.concat([train, fare_bin_dummies_df], axis=1)
特征工程
在进行特征工程时,需要将训练集和测试集一起处理,使两者具有相同的数据分布和数据类型。
#1.合并训练集和测试集
train_df_org = pd.read_csv('train.csv')
test_df_org = pd.read_csv('test.csv')
test_df_org['Survived'] = 0
combined_train_test = train_df_org.append(test_df_org)
PassengerId = test_df_org['PassengerId']
#2. “Embarked”项的缺失值不多,以众数来填充
# 为了后面的特征分析,这里将 Embarked 特征进行facrorizing
combined_train_test['Embarked'] = pd.factorize(combined_train_test['Embarked'])[0]
# 使用 pd.get_dummies 获取one-hot 编码
emb_dummies_df = pd.get_dummies(combined_train_test['Embarked'], prefix=combined_train_test[['Embarked']].columns[0])
combined_train_test = pd.concat([combined_train_test, emb_dummies_df], axis=1)
#3. 对Sex,Pclass也进行dummy处理码。
#4. Fare项在测试数据中缺少一个值,以平均值填充。
#5. Pclass将其转换为dummy形式即可。但是为了更好的分析问题,我们这里假设对于不同等级的船舱,各船舱内部的票价也说明了各等级舱的位置,那么也就很有可能与逃生的顺序有关系。所以这里分出每等舱里的高价和低价位。from sklearn.preprocessing import LabelEncoder
# 建立PClass Fare Category
def pclass_fare_category(df, pclass1_mean_fare, pclass2_mean_fare, pclass3_mean_fare):
if df['Pclass'] == 1:
if df['Fare'] <= pclass1_mean_fare:
return 'Pclass1_Low'
else:
return 'Pclass1_High'
elif df['Pclass'] == 2:
if df['Fare'] <= pclass2_mean_fare:
return 'Pclass2_Low'
else:
return 'Pclass2_High'
elif df['Pclass'] == 3:
if df['Fare'] <= pclass3_mean_fare:
return 'Pclass3_Low'
else:
return 'Pclass3_High'
Pclass1_mean_fare = combined_train_test['Fare'].groupby(by=combined_train_test['Pclass']).mean.get([1]).values[0]
Pclass2_mean_fare = combined_train_test['Fare'].groupby(by=combined_train_test['Pclass']).mean.get([2]).values[0]
Pclass3_mean_fare = combined_train_test['Fare'].groupby(by=combined_train_test['Pclass']).mean.get([3]).values[0]
# 建立Pclass_Fare Category
combined_train_test['Pclass_Fare_Category'] = combined_train_test.apply(pclass_fare_category, args=(
Pclass1_mean_fare, Pclass2_mean_fare, Pclass3_mean_fare), axis=1)
pclass_level = LabelEncoder
# 给每一项添加标签
pclass_level.fit(np.array(
['Pclass1_Low', 'Pclass1_High', 'Pclass2_Low', 'Pclass2_High', 'Pclass3_Low', 'Pclass3_High']))
# 转换成数值
combined_train_test['Pclass_Fare_Category'] = pclass_level.transform(combined_train_test['Pclass_Fare_Category'])
# dummy 转换
pclass_dummies_df = pd.get_dummies(combined_train_test['Pclass_Fare_Category']).rename(columns=lambda x: 'Pclass_' + str(x))
combined_train_test = pd.concat([combined_train_test, pclass_dummies_df], axis=1)
#6. 亲友的数量太少或者太多会影响到Survived。所以将二者合并为FamliySize这一组合项,同时也保留这两项。
def family_size_category(family_size):
if family_size <= 1:
return 'Single'
elif family_size <= 4:
return 'Small_Family'
else:
return 'Large_Family'
combined_train_test['Family_Size'] = combined_train_test['Parch'] + combined_train_test['SibSp'] + 1
combined_train_test['Family_Size_Category'] = combined_train_test['Family_Size'].map(family_size_category)
le_family = LabelEncoder
le_family.fit(np.array(['Single', 'Small_Family', 'Large_Family']))
combined_train_test['Family_Size_Category'] = le_family.transform(combined_train_test['Family_Size_Category'])
family_size_dummies_df = pd.get_dummies(combined_train_test['Family_Size_Category'],
prefix=combined_train_test[['Family_Size_Category']].columns[0])
combined_train_test = pd.concat([combined_train_test, family_size_dummies_df], axis=1)
#7. 建立模型对"Age"做预测
missing_age_df = pd.DataFrame(combined_train_test[['Age', 'Embarked', 'Sex', 'Title', 'Name_length', 'Family_Size', 'Family_Size_Category','Fare', 'Fare_bin_id', 'Pclass']])
missing_age_train = missing_age_df[missing_age_df['Age'].not]
missing_age_test = missing_age_df[missing_age_df['Age'].is]
from sklearn import ensemble
from sklearn import model_selection
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.ensemble import RandomForestRegressor
def fill_missing_age(missing_age_train, missing_age_test):
missing_age_X_train = missing_age_train.drop(['Age'], axis=1)
missing_age_Y_train = missing_age_train['Age']
missing_age_X_test = missing_age_test.drop(['Age'], axis=1)
# GBM模型
gbm_reg = GradientBoostingRegressor(random_state=42)
gbm_reg_param_grid = {'n_estimators': [2000], 'max_depth': [4], 'learning_rate': [0.01], 'max_features': [3]}
gbm_reg_grid = model_selection.GridSearchCV(gbm_reg, gbm_reg_param_grid, cv=10, n_jobs=25, verbose=1, scoring='neg_mean_squared_error')
gbm_reg_grid.fit(missing_age_X_train, missing_age_Y_train)
print('Age feature Best GB Params:' + str(gbm_reg_grid.best_params_))
print('Age feature Best GB Score:' + str(gbm_reg_grid.best_score_))
print('GB Train Error for "Age" Feature Regressor:' + str(gbm_reg_grid.score(missing_age_X_train, missing_age_Y_train)))
missing_age_test.loc[:, 'Age_GB'] = gbm_reg_grid.predict(missing_age_X_test)
print(missing_age_test['Age_GB'][:4])
# RF模型
rf_reg = RandomForestRegressor
rf_reg_param_grid = {'n_estimators': [200], 'max_depth': [5], 'random_state': [0]}
rf_reg_grid = model_selection.GridSearchCV(rf_reg, rf_reg_param_grid, cv=10, n_jobs=25, verbose=1, scoring='neg_mean_squared_error')
rf_reg_grid.fit(missing_age_X_train, missing_age_Y_train)
print('Age feature Best RF Params:' + str(rf_reg_grid.best_params_))
print('Age feature Best RF Score:' + str(rf_reg_grid.best_score_))
print('RF Train Error for "Age" Feature Regressor' + str(rf_reg_grid.score(missing_age_X_train, missing_age_Y_train)))
missing_age_test.loc[:, 'Age_RF'] = rf_reg_grid.predict(missing_age_X_test)
print(missing_age_test['Age_RF'][:4])
# 模型融合
print('shape1', missing_age_test['Age'].shape, missing_age_test[['Age_GB', 'Age_RF']].mode(axis=1).shape)
# missing_age_test['Age'] = missing_age_test[['Age_GB', 'Age_LR']].mode(axis=1)
missing_age_test.loc[:, 'Age'] = np.mean([missing_age_test['Age_GB'], missing_age_test['Age_RF']])
print(missing_age_test['Age'][:4])
missing_age_test.drop(['Age_GB', 'Age_RF'], axis=1, inplace=True)
return missing_age_test
combined_train_test.loc[(combined_train_test.Age.is()), 'Age'] = fill_missing_age(missing_age_train, missing_age_test)
#8. Cabin缺失值太多,去掉。#9. 观察Ticket的值,我们可以看到,Ticket有字母和数字之分,而对于不同的字母,可能在很大程度上就意味着船舱等级或者不同船舱的位置,也会对Survived产生一定的影响,所以
#我们将Ticket中的字母分开,为数字的部分则分为一类。
ombined_train_test['Ticket_Letter'] = combined_train_test['Ticket'].str.split.str[0]
combined_train_test['Ticket_Letter'] = combined_train_test['Ticket_Letter'].apply(lambda x: 'U0' if x.isnumeric else x)
# 如果要提取数字信息,则也可以这样做,现在我们对数字票单纯地分为一类。
# combined_train_test['Ticket_Number'] = combined_train_test['Ticket'].apply(lambda x: pd.to_numeric(x, errors='coerce'))
# combined_train_test['Ticket_Number'].fillna(0, inplace=True)
# 将 Ticket_Letter factorize
combined_train_test['Ticket_Letter'] = pd.factorize(combined_train_test['Ticket_Letter'])[0]
Pearson关联图
Correlation = pd.DataFrame(combined_train_test[['Embarked', 'Sex', 'Title', 'Name_length', 'Family_Size', 'Family_Size_Category','Fare', 'Fare_bin_id', 'Pclass', 'Pclass_Fare_Category', 'Age', 'Ticket_Letter', 'Cabin']])
#训练集和测试集上的关联图
colormap = plt.cm.viridis
plt.figure(figsize=(14,12))
plt.title('Pearson Correlation of Features', y=1.05, size=15)
sns.heatmap(Correlation.astype(float).corr,linewidths=0.1,vmax=1.0, square=True, cmap=colormap, linecolor='white', annot=True)
#训练集上的关联图
colormap = plt.cm.RdBu
plt.figure(figsize=(14,12))
plt.title('Pearson Correlation of Features', y=1.05, size=15)
sns.heatmap(train.astype(float).corr,linewidths=0.1,vmax=1.0, square=True, cmap=colormap, linecolor='white', annot=True)
结论:Pearson Correlation图可以告诉我们:没有太多的特征彼此强烈相关,即训练集没有太多冗余或多余的数据。 以下是两个最相关的特征:FamilySize和Parch相关性较强。 为了本练习的目的,我仍然会将这两个特征都保留下来。
Pairplots图
#训练集上的Pairplot
g = sns.pairplot(train[[u'Survived', u'Pclass', u'Sex', u'Age', u'Parch', u'Fare', u'Embarked',u'FamilySize', u'Title']], hue='Survived', palette = 'seismic',size=1.2,diag_kind = 'kde',diag_kws=dict(shade=True),plot_kws=dict(s=10) )
g.set(xticklabels=[])
训练模型做预测
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.metrics import accuracy_score, log_loss
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis, QuadraticDiscriminantAnalysis
from sklearn.linear_model import LogisticRegression
classifiers = [
KNeighborsClassifier(3),
SVC(probability=True),
DecisionTreeClassifier(),
RandomForestClassifier(),
AdaBoostClassifier(),
GradientBoostingClassifier(),
GaussianNB(),
LinearDiscriminantAnalysis(),
QuadraticDiscriminantAnalysis(),
LogisticRegression()]
log_cols = ["Classifier", "Accuracy"]
log = pd.DataFrame(columns=log_cols)
sss = StratifiedShuffleSplit(n_splits=10, test_size=0.1, random_state=0) #划分成10份,
X = train[0::, 1::] #train的第一行开始,一直到最后一行,从第2列开始,直到最后一列,:表示以1递增。
y = train[0::, 0]
acc_dict = {}
for train_index, test_index in sss.split(X, y):
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
for clf in classifiers:
name = clf.__class__.__name__
clf.fit(X_train, y_train)
train_predictions = clf.predict(X_test)
acc = accuracy_score(y_test, train_predictions)
if name in acc_dict:
acc_dict[name] += acc
else:
acc_dict[name] = acc
for clf in acc_dict:
acc_dict[clf] = acc_dict[clf] / 10.0
log_entry = pd.DataFrame([[clf, acc_dict[clf]]], columns=log_cols)
log = log.append(log_entry)
plt.xlabel('Accuracy')
plt.title('Classifier Accuracy')
sns.set_color_codes("muted")
sns.barplot(x='Accuracy', y='Classifier', data=log, color="b")
模型融合与测试
1.利用不同的模型选择特征
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.tree import DecisionTreeClassifier
#返回每个模型选择的最重要的top_n_features个特征,返回的特征相当于5个模型选择的特征的并集,以及返回所有特征及对应分数的字典。
#输出最重要的是个特征。
def get_top_n_features(titanic_train_data_X, titanic_train_data_Y, top_n_features):
# RF
rf_est = RandomForestClassifier(random_state=0)
rf_param_grid = {'n_estimators': [500], 'min_samples_split': [2, 3], 'max_depth': [20]}
rf_grid = model_selection.GridSearchCV(rf_est, rf_param_grid, n_jobs=25, cv=10, verbose=1)
rf_grid.fit(titanic_train_data_X, titanic_train_data_Y)
print('Top N Features Best RF Params:' + str(rf_grid.best_params_))
print('Top N Features Best RF Score:' + str(rf_grid.best_score_))
print('Top N Features RF Train Score:' + str(rf_grid.score(titanic_train_data_X, titanic_train_data_Y)))
feature_imp_sorted_rf = pd.DataFrame({'feature': list(titanic_train_data_X),
'importance': rf_grid.best_estimator_.feature_importances_}).sort_values('importance', ascending=False)
#feature_imp_sorted_rf的前top_n_features个特征。
features_top_n_rf = feature_imp_sorted_rf.head(top_n_features)['feature']
print('Sample 10 Features from RF Classifier')
print(str(features_top_n_rf[:10]))
# AdaBoost
ada_est =AdaBoostClassifier(random_state=0)
ada_param_grid = {'n_estimators': [500], 'learning_rate': [0.01, 0.1]}
ada_grid = model_selection.GridSearchCV(ada_est, ada_param_grid, n_jobs=25, cv=10, verbose=1)
ada_grid.fit(titanic_train_data_X, titanic_train_data_Y)
print('Top N Features Best Ada Params:' + str(ada_grid.best_params_))
print('Top N Features Best Ada Score:' + str(ada_grid.best_score_))
print('Top N Features Ada Train Score:' + str(ada_grid.score(titanic_train_data_X, titanic_train_data_Y)))
feature_imp_sorted_ada = pd.DataFrame({'feature': list(titanic_train_data_X),
'importance': ada_grid.best_estimator_.feature_importances_}).sort_values('importance', ascending=False)
features_top_n_ada = feature_imp_sorted_ada.head(top_n_features)['feature']
print('Sample 10 Feature from Ada Classifier:')
print(str(features_top_n_ada[:10]))
# ExtraTree
et_est = ExtraTreesClassifier(random_state=0)
et_param_grid = {'n_estimators': [500], 'min_samples_split': [3, 4], 'max_depth': [20]}
et_grid = model_selection.GridSearchCV(et_est, et_param_grid, n_jobs=25, cv=10, verbose=1)
et_grid.fit(titanic_train_data_X, titanic_train_data_Y)
print('Top N Features Best ET Params:' + str(et_grid.best_params_))
print('Top N Features Best ET Score:' + str(et_grid.best_score_))
print('Top N Features ET Train Score:' + str(et_grid.score(titanic_train_data_X, titanic_train_data_Y)))
feature_imp_sorted_et = pd.DataFrame({'feature': list(titanic_train_data_X),
'importance': et_grid.best_estimator_.feature_importances_}).sort_values('importance', ascending=False)
features_top_n_et = feature_imp_sorted_et.head(top_n_features)['feature']
print('Sample 10 Features from ET Classifier:')
print(str(features_top_n_et[:10]))
# GradientBoosting
gb_est =GradientBoostingClassifier(random_state=0)
gb_param_grid = {'n_estimators': [500], 'learning_rate': [0.01, 0.1], 'max_depth': [20]}
gb_grid = model_selection.GridSearchCV(gb_est, gb_param_grid, n_jobs=25, cv=10, verbose=1)
gb_grid.fit(titanic_train_data_X, titanic_train_data_Y)
print('Top N Features Best GB Params:' + str(gb_grid.best_params_))
print('Top N Features Best GB Score:' + str(gb_grid.best_score_))
print('Top N Features GB Train Score:' + str(gb_grid.score(titanic_train_data_X, titanic_train_data_Y)))
feature_imp_sorted_gb = pd.DataFrame({'feature': list(titanic_train_data_X),
'importance': gb_grid.best_estimator_.feature_importances_}).sort_values('importance', ascending=False)
features_top_n_gb = feature_imp_sorted_gb.head(top_n_features)['feature']
print('Sample 10 Feature from GB Classifier:')
print(str(features_top_n_gb[:10]))
# DecisionTree
dt_est = DecisionTreeClassifier(random_state=0)
dt_param_grid = {'min_samples_split': [2, 4], 'max_depth': [20]}
dt_grid = model_selection.GridSearchCV(dt_est, dt_param_grid, n_jobs=25, cv=10, verbose=1)
dt_grid.fit(titanic_train_data_X, titanic_train_data_Y)
print('Top N Features Best DT Params:' + str(dt_grid.best_params_))
print('Top N Features Best DT Score:' + str(dt_grid.best_score_))
print('Top N Features DT Train Score:' + str(dt_grid.score(titanic_train_data_X, titanic_train_data_Y)))
feature_imp_sorted_dt = pd.DataFrame({'feature': list(titanic_train_data_X),
'importance': dt_grid.best_estimator_.feature_importances_}).sort_values('importance', ascending=False)
features_top_n_dt = feature_imp_sorted_dt.head(top_n_features)['feature']
print('Sample 10 Features from DT Classifier:')
print(str(features_top_n_dt[:10]))
# 合并5个模型,删掉重复的项,留下一项。即找出所有模型中重要的特征。
features_top_n = pd.concat([features_top_n_rf, features_top_n_ada, features_top_n_et, features_top_n_gb, features_top_n_dt],
ignore_index=True).drop_duplicates
features_importance = pd.concat([feature_imp_sorted_rf, feature_imp_sorted_ada, feature_imp_sorted_et,
feature_imp_sorted_gb, feature_imp_sorted_dt],ignore_index=True)
return features_top_n , features_importance
上述代码可简化如下:
def get_top_n_features(clf,params,cv,X_train,y_train, top_n_features):
grid_search = model_selection.GridSearchCV(clf, params, n_jobs=-1, cv=cv, verbose=1)
grid_search.fit(X_train, y_train)
print('Top N Features Best {} Params:{}'.format(clf.__class__.__name__,str(grid_search.best_params_)))
print('Top N Features Best {} Score:{}'.format(clf.__class__.__name__,str(grid_search.best_score_)))
print('Top N Features {} Train Score:{}'.format(clf.__class__.__name__,str(grid_search.score(X_train, y_train))))
#X_train使pd类型,list(X_train)给出pd结构的特征名。
feature_imp_sorted = pd.DataFrame({'feature': list(X_train),
'importance': grid_search.best_estimator_.feature_importances_}).sort_values('importance', ascending=False)
features_top_n = feature_imp_sorted.head(top_n_features)['feature']
print('Sample 10 Features from RF Classifier')
print(str(features_top_n[:10]))
return features_top_n , features_importance
2.根据筛选的特征构造训练集和测试集
#特征工程可能会产生了大量的特征,而特征与特征之间会存在一定的相关性。太多的特征不
#仅会影响模型训练的速度,也可能使模型过拟合。所以在特征太多的情况下,我们可以利用
#不同的模型对特征进行筛选,选取出我们想要的前n个特征。
feature_to_pick = 30
feature_top_n, feature_importance = get_top_n_features(titanic_train_data_X, titanic_train_data_Y, feature_to_pick)
titanic_train_data_X = pd.DataFrame(titanic_train_data_X[feature_top_n])
titanic_test_data_X = pd.DataFrame(titanic_test_data_X[feature_top_n])
3. 可视化特征重要性
rf_feature_imp = feature_importance[:10]
Ada_feature_imp = feature_importance[32:32+10].reset_index(drop=True)
# 计算特征相对重要性
rf_feature_importance = 100.0 * (rf_feature_imp['importance'] / rf_feature_imp['importance'].max)
Ada_feature_importance = 100.0 * (Ada_feature_imp['importance'] / Ada_feature_imp['importance'].max)
# 获得所有目标特征的索引
rf_important_idx = np.where(rf_feature_importance)[0]
Ada_important_idx = np.where(Ada_feature_importance)[0]
# Adapted from http://scikit-learn.org/stable/auto_examples/ensemble/plot_gradient_boosting_regression.html
pos = np.arange(rf_important_idx.shape[0]) + .5
plt.figure(1, figsize = (18, 8))
plt.subplot(121)
plt.barh(pos, rf_feature_importance[rf_important_idx][::-1])
plt.yticks(pos, rf_feature_imp['feature'][::-1])
plt.xlabel('Relative Importance')
plt.title('RandomForest Feature Importance')
plt.subplot(122)
plt.barh(pos, Ada_feature_importance[Ada_important_idx][::-1])
plt.yticks(pos, Ada_feature_imp['feature'][::-1])
plt.xlabel('Relative Importance')
plt.title('AdaBoost Feature Importance')
plt.show
模型融合
常见的模型融合方法有:Bagging、Boosting、Stacking、Blending。
Bagging
Bagging 将多个模型,也就是多个基学习器的预测结果进行简单的加权平均或者投票。它的好处是可以并行地训练基学习器。Random Forest就用到了Bagging的思想。
Boosting
Boosting 的思想有点像知错能改,每个基学习器是在上一个基学习器学习的基础上,对上一个基学习器的错误进行弥补。我们将会用到的 AdaBoost,Gradient Boost 就用到了这种思想。
Stacking
Stacking是用新的次学习器去学习如何组合上一层的基学习器。如果把 Bagging 看作是多个基分类器的线性组合,那么Stacking就是多个基分类器的非线性组合。Stacking可以将学习器一层一层地堆砌起来,形成一个网状的结构。
相比来说Stacking的融合框架相对前面的二者来说在精度上确实有一定的提升,所以在下面的模型融合上,我们也使用Stacking方法。
Blending
Blending 和 Stacking 很相似,但同时它可以防止信息泄露的问题。
本例的Stacking融合:两层结构,RandomForest、AdaBoost、ExtraTrees、GBDT、DecisionTree、KNN、SVM ,一共7个模型的预测结果作为输入,再利用XGBoost模型训练做预测。
from sklearn.model_selection import KFold
ntrain = titanic_train_data_X.shape[0]
ntest = titanic_test_data_X.shape[0]
SEED = 0 # for reproducibility
NFOLDS = 7 # set folds for out-of-fold prediction
kf = KFold(n_splits = NFOLDS, random_state=SEED, shuffle=False)
def get_out_fold(clf, x_train, y_train, x_test):
oof_train = np.zeros((ntrain,))
oof_test = np.zeros((ntest,))
oof_test_skf = np.empty((NFOLDS, ntest))
for i, (train_index, test_index) in enumerate(kf.split(x_train)):
x_tr = x_train[train_index]
y_tr = y_train[train_index]
x_te = x_train[test_index]
clf.fit(x_tr, y_tr)
oof_train[test_index] = clf.predict(x_te)
oof_test_skf[i, :] = clf.predict(x_test)
oof_test[:] = oof_test_skf.mean(axis=0)
return oof_train.reshape(-1, 1), oof_test.reshape(-1, 1)
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
rf = RandomForestClassifier(n_estimators=500, warm_start=True, max_features='sqrt',max_depth=6,
min_samples_split=3, min_samples_leaf=2, n_jobs=-1, verbose=0)
ada = AdaBoostClassifier(n_estimators=500, learning_rate=0.1)
et = ExtraTreesClassifier(n_estimators=500, n_jobs=-1, max_depth=8, min_samples_leaf=2, verbose=0)
gb = GradientBoostingClassifier(n_estimators=500, learning_rate=0.008, min_samples_split=3, min_samples_leaf=2, max_depth=5, verbose=0)
dt = DecisionTreeClassifier(max_depth=8)
knn = KNeighborsClassifier(n_neighbors = 2)
svm = SVC(kernel='linear', C=0.025)
x_train = titanic_train_data_X.values # Creates an array of the train data
x_test = titanic_test_data_X.values # Creats an array of the test data
y_train = titanic_train_data_Y.values
# Create our OOF train and test predictions. These base results will be used as new features
rf_oof_train, rf_oof_test = get_out_fold(rf, x_train, y_train, x_test) # Random Forest
ada_oof_train, ada_oof_test = get_out_fold(ada, x_train, y_train, x_test) # AdaBoost
et_oof_train, et_oof_test = get_out_fold(et, x_train, y_train, x_test) # Extra Trees
gb_oof_train, gb_oof_test = get_out_fold(gb, x_train, y_train, x_test) # Gradient Boost
dt_oof_train, dt_oof_test = get_out_fold(dt, x_train, y_train, x_test) # Decision Tree
knn_oof_train, knn_oof_test = get_out_fold(knn, x_train, y_train, x_test) # KNeighbors
svm_oof_train, svm_oof_test = get_out_fold(svm, x_train, y_train, x_test) # Support Vector
print("Training is complete")
x_train = np.concatenate((rf_oof_train, ada_oof_train, et_oof_train, gb_oof_train, dt_oof_train, knn_oof_train, svm_oof_train), axis=1)
x_test = np.concatenate((rf_oof_test, ada_oof_test, et_oof_test, gb_oof_test, dt_oof_test, knn_oof_test, svm_oof_test), axis=1)
from xgboost import XGBClassifier
gbm = XGBClassifier( n_estimators= 2000, max_depth= 4, min_child_weight= 2, gamma=0.9, subsample=0.8,
colsample_bytree=0.8, objective= 'binary:logistic', nthread= -1, scale_pos_weight=1).fit(x_train, y_train)
predictions = gbm.predict(x_test)
提交预测结果:
Submission = pd.DataFrame({'PassengerId': PassengerId, 'Survived': predictions})
Submission.to_csv('StackingSubmission.csv',index=False,sep=',')
最后,编写了一个SklearnHelper类,允许我们方便调用不同的模型。
class SklearnHelper(object):
def __init__(self, clf, seed=0, params=None):
params['random_state'] = seed
self.clf = clf(**params)
def train(self, x_train, y_train):
self.clf.fit(x_train, y_train)
def predict(self, x):
return self.clf.predict(x)
def fit(self,x,y):
return self.clf.fit(x,y)
本文主要内容来自于:
https://www.kaggle.com/arthurtok/introduction-to-ensembling-stacking-in-python/notebook
https://blog.csdn.net/Koala_Tree/article/details/78725881
https://www.kaggle.com/mmueller/stacking-starter
猜你喜欢
- 2024-09-11 案例算法 | 机器学习python应用,简单机器学习项目实践
- 2024-09-11 "Python可视化神作:16大案例,国界大佬私藏,源码放送!"
- 2024-09-11 高斯混合模型 GMM 的详细解释(高斯混合模型图像分类)
- 2024-09-11 模态测试和核密度估计(模态测试方法)
- 2024-09-11 Vega图表示例库(上)(vega定义)
- 2024-09-11 数据可视化:解析小提琴图(Violin plots)
- 2024-09-11 如何知道一个变量的分布是否为高斯分布?
- 2024-09-11 [seaborn] seaborn学习笔记8-避免过度绘图Avoid Overplotting
- 2024-09-11 【Python可视化系列】一文教会你绘制美观的直方图(理论+源码)
- 2024-09-11 Python数据可视化 | 1、数据可视化流程
- 最近发表
- 标签列表
-
- cmd/c (57)
- c++中::是什么意思 (57)
- sqlset (59)
- ps可以打开pdf格式吗 (58)
- phprequire_once (61)
- localstorage.removeitem (74)
- routermode (59)
- vector线程安全吗 (70)
- & (66)
- java (73)
- org.redisson (64)
- log.warn (60)
- cannotinstantiatethetype (62)
- js数组插入 (83)
- resttemplateokhttp (59)
- gormwherein (64)
- linux删除一个文件夹 (65)
- mac安装java (72)
- reader.onload (61)
- outofmemoryerror是什么意思 (64)
- flask文件上传 (63)
- eacces (67)
- 查看mysql是否启动 (70)
- java是值传递还是引用传递 (58)
- 无效的列索引 (74)