优秀的编程知识分享平台

网站首页 > 技术文章 正文

Titanic生存问题预测(科技创新对我国来说不仅是发展问题更是生存问题)

nanyue 2024-09-11 05:28:37 技术文章 7 ℃

数据集:

训练集:892 x 12:892人,11个特征,1个标签。

测试集:418 x 11:418人,11个特征。

变量:

survival: 生存状况,0 = No, 1 = Yes;即标签

pclass: 船票级别,1 = 1st, 2 = 2nd, 3 = 3rd

sex: 性别

Age: 年级

sibsp: Titanic上兄弟姐妹/和配偶的数目量

parch:Titanic上父母及孩子的数量,若孩子仅与保姆出行,parch=0

ticket: 船票号码

fare: 旅客票价

cabin: 船舱类型

embarked: 出发港口,C = Cherbourg, Q = Queenstown, S = Southampton

PassengerId:乘客序号

Name:姓名

导入相关依赖库:

%matplotlib inline
import pandas as pd
import numpy as np
import re
import xgboost as xgb
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')
from sklearn.svm import SVC
from sklearn.cross_validation import KFold
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.metrics import accuracy_score, log_loss
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier,GradientBoostingClassifier, ExtraTreesClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis, QuadraticDiscriminantAnalysis

查看数据集:

train=pd.read_csv('train.csv')
test=pd.read.csv('test.csv')
train.head(5)
#查看数据集详细信息:数据类型,是否有缺失值等。
print(train.info) #与此类似的函数train['Age'].describe,输出'Age'的统计信息
#输出结果
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId 891 non- int64
Survived 891 non- int64
Pclass 891 non- int64
Name 891 non- object
Sex 891 non- object
Age 714 non- float64
SibSp 891 non- int64
Parch 891 non- int64
Ticket 891 non- object
Fare 891 non- float64
Cabin 204 non- object
Embarked 889 non- object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB
None

测试集:

print(test.info)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
PassengerId 418 non- int64
Pclass 418 non- int64
Name 418 non- object
Sex 418 non- object
Age 332 non- float64
SibSp 418 non- int64
Parch 418 non- int64
Ticket 418 non- object
Fare 417 non- float64
Cabin 91 non- object
Embarked 418 non- object
dtypes: float64(2), int64(4), object(5)
memory usage: 36.0+ KB
None

可以看出,Fare,Carbin, Age,Embark有缺失值。

填充缺失值前,先查看数据分布,利用数据可视化。

数据可视化:

#1.画‘Survived’标签的饼图,生成比率38.38%。
train['Survived'].value_counts.plot.pie(autopct = '%1.2f%%')
#sns画出'Embarked'的直方图,从S港口出发的最多。
sns.countplot(y='Embarked',data=train,hue_order='ALIGN',saturation=1)

缺失值处理:

1. 如果数据集很大,缺失数据少,直接删除。

2. 如果某个特征对于预测不那么重要,对缺失值赋均值或取众数处理。'Embarked'特征缺失两个值,缺失值填充‘S’即可。

train['Embarked'] = train['Embarked'].fillna('S')
train['Fare'] = train['Fare'].fillna(train['Fare'].median)

3. 对于某些类别特征,可以赋一个代表缺失的值,比如‘U0’。因为缺失本身也可能代表着一些隐含信息。比如船舱号Cabin这一属性,缺失可能代表并没有船舱。

4. 使用模型预测缺失值。‘Age’在本例中是一个重要的特征,并且缺失数据较多,故对填充值的准确性有一定要求。一般情况下,使用数据完整的样例来预测缺失值,最后再填充。也有将缺失值填充为‘Age’的(mean - std) 和 (mean + std)的随机数。

#方法一:填充随机数
age_avg = train['Age'].mean
age_std = train['Age'].std
age__count = train['Age'].is.sum
age__random_list = np.random.randint(age_avg - age_std,age_avg + age_std, size=age__count)
train['Age'][np.isnan(train['Age'])]=age__random_list
train['Age'] = train['Age'].astype(int)
#方法二:用随机森林回归对缺失值做预测!
from sklearn.ensemble import RandomForestRegressor
#选取训练集预测‘Age’
age_df = train[['Age','Survived','Fare', 'Parch', 'SibSp', 'Pclass']]
age_df_not = age_df.loc[(train['Age'].not)]
age_df_is = age_df.loc[(train['Age'].is)]
X = age_df_not.values[:,1:]
Y = age_df_not.values[:,0]
#使用随机森林回归训练模型
RFR = RandomForestRegressor(n_estimators=1000, n_jobs=-1)
RFR.fit(X,Y)
predictAges = RFR.predict(age_df_is.values[:,1:])
train.loc[train['Age'].is, ['Age']]= predictAges

查看缺失值处理后的数据集信息:

print(train.info)

数据相关性分析

#查看‘Pclass’对‘Survived’的影响
print(train[['Pclass', 'Survived']].groupby(['Pclass'], as_index=False).mean)
#或可视化上述结果
train[['Pclass','Survived']].groupby(['Pclass']).mean.plot.bar
#查看'Sex'和‘Pclass’对'Survived'的影响,可以看出不同性别(‘female'=0,'male'=1)和等级的情况下,对生存率的影响。
train[['Sex','Pclass','Survived']].groupby(['Pclass','Sex']).mean.plot.bar

结论:不同等级的船舱,女士生存率都较高,但不同等级的船舱还是有一定区别。

#查看'Age'与'Survived'的关系,使用小提琴图。
fig, ax = plt.subplots(1, 2, figsize = (18, 8))
sns.violinplot("Pclass", "Age", hue="Survived", data=train, split=True, ax=ax[0])
ax[0].set_title('Pclass and Age vs Survived')
ax[0].set_yticks(range(0, 110, 10))
sns.violinplot("Sex", "Age", hue="Survived", data=train, split=True, ax=ax[1])
ax[1].set_title('Sex and Age vs Survived')
ax[1].set_yticks(range(0, 110, 10))
plt.show
#将‘Age’等分成5个区间,分组,查看不同组对'Survived'的影响。
train['CategoricalAge'] = pd.cut(train['Age'], 5)
print (train[['CategoricalAge', 'Survived']].groupby(['CategoricalAge'], as_index=False).mean)
#作图
average_age=train[['CategoricalAge', 'Survived']].groupby(['CategoricalAge'], as_index=False).mean
sns.barplot(x='CategoricalAge', y='Survived', data=average_age)
#按自己设定的区间分组
bins = [0, 12, 18, 65, 100]
train['Age_group'] = pd.cut(train['Age'], bins)
by_age = train.groupby('Age_group')['Survived'].mean
by_age
#分析总体年龄分布和盒图。
plt.figure(figsize=(12,5))
plt.subplot(121)
train['Age'].hist(bins=70)
plt.xlabel('Age')
plt.ylabel('Num')

plt.subplot(122)
train.boxplot(column='Age', showfliers=False)
plt.show
#分析年龄的统计信息。
train['Age'].describe
#不同'Age'下的'Survived'
facet = sns.FacetGrid(train, hue="Survived",aspect=4)
facet.map(sns.kdeplot,'Age',shade= True)
facet.set(xlim=(0, train['Age'].max))
facet.add_legend
#"Name"特征,我们可以找到人的头衔title,没有title则返回""
def get_title(name):
title_search = re.search(' ([A-Za-z]+)\.', name)
if title_search:
return title_search.group(1)
return ""


train['Title'] = train['Name'].apply(get_title)
print(pd.crosstab(train['Title'], train['Sex']))
#输出结果:
Sex female male
Title
Capt 0 1
Col 0 2
Countess 1 0
Don 0 1
Dr 1 6
Jonkheer 0 1
Lady 1 0
Major 0 2
Master 0 40
Miss 182 0
Mlle 2 0
Mme 1 0
Mr 0 517
Mrs 125 0
Ms 1 0
Rev 0 6
Sir 0 1
#查看"Title"对"Survived"生还率的影响。用“Miss”代替"Mlle"等。
train['Title'] = train['Title'].replace(['Lady', 'Countess','Capt', 'Col','Don', 'Dr', 'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona'], 'Rare')
train['Title'] =train['Title'].replace('Mlle', 'Miss')
train['Title'] = train['Title'].replace('Ms', 'Miss')
train['Title'] = train['Title'].replace('Mme', 'Mrs')
#查看'Title'对'Survived'的影响。
print(train[['Title', 'Survived']].groupby(['Title'], as_index=False).mean)
#有无兄弟姐妹‘SibSp’和存活‘Survived’与否的关系 
sibsp_df = train[train['SibSp'] != 0]
no_sibsp_df = train[train['SibSp'] == 0]
plt.figure(figsize=(10,5))
plt.subplot(121)
sibsp_df['Survived'].value_counts.plot.pie(labels=['No Survived', 'Survived'], autopct = '%1.1f%%')
plt.xlabel('sibsp')

plt.subplot(122)
no_sibsp_df['Survived'].value_counts.plot.pie(labels=['No Survived', 'Survived'], autopct = '%1.1f%%')
plt.xlabel('no_sibsp')
plt.show
#有无父母子女和存活与否的关系 Parch
parch_df = train[train['Parch'] != 0]
no_parch_df = train[train['Parch'] == 0]

plt.figure(figsize=(10,5))
plt.subplot(121)
parch_df['Survived'].value_counts.plot.pie(labels=['No Survived', 'Survived'], autopct = '%1.1f%%')
plt.xlabel('parch')

plt.subplot(122)
no_parch_df['Survived'].value_counts.plot.pie(labels=['No Survived', 'Survived'], autopct = '%1.1f%%')
plt.xlabel('no_parch')
plt.show
#亲友的人数和存活与否的关系 SibSp & Parch
fig,ax=plt.subplots(1,2,figsize=(18,8))
train[['Parch','Survived']].groupby(['Parch']).mean.plot.bar(ax=ax[0])
ax[0].set_title('Parch and Survived')
train[['SibSp','Survived']].groupby(['SibSp']).mean.plot.bar(ax=ax[1])
ax[1].set_title('SibSp and Survived')
#出行的亲友人数对生存率的影响。可见,出行的亲友太少或太多,生存概率较低。
train_data['Family_Size'] = train_data['Parch'] + train_data['SibSp'] + 1
train_data[['Family_Size','Survived']].groupby(['Family_Size']).mean.plot.bar
#查看票价均值和方差与生存与否的关系
#票价与是否生还有一定的相关性,生还者的平均票价要大于未生还者的平均票价。
fare_not_survived = train['Fare'][train['Survived'] == 0]
fare_survived = train['Fare'][train['Survived'] == 1]
average_fare = pd.DataFrame([fare_not_survived.mean(), fare_survived.mean()])
std_fare = pd.DataFrame([fare_not_survived.std(), fare_survived.std()])
average_fare.plot(yerr=std_fare, kind='bar', legend=False)
plt.show
#船舱类型Cabin和存活与否的关系
#由于船舱的缺失值确实太多,有效值仅仅有204个,很难分析出不同的船舱和存
#的关系,所以在做特征工程的时候,可以直接将该组特征丢弃。当然,这里我们也#可以对其进行一下分析,对于缺失的数据都分为一类。简单地将数据分为是否有#Cabin记录作为特征,与生存与否进行分析:
# 用"U0"填充
train.loc[train.Cabin.is(), 'Cabin'] = 'U0'
train['Has_Cabin'] = train['Cabin'].apply(lambda x: 0 if x == 'U0' else 1)
train_data[['Has_Cabin','Survived']].groupby(['Has_Cabin']).mean.plot.bar
#对不同类型的船舱进行分析,将特征‘Cabin’的字母部分创建特征。
#可见,不同的船舱生存率也有不同,但是差别不大。所以在处理中,我们可以直接将特征删除。
train['CabinLetter'] = train['Cabin'].map(lambda x: re.compile("([a-zA-Z]+)").search(x).group)
train['CabinLetter'] = pd.factorize(train['CabinLetter'])[0]
train[['CabinLetter','Survived']].groupby(['CabinLetter']).mean.plot.bar
#港口和存活与否的关系 Embarked
#泰坦尼克号从英国的南安普顿港出发,途径法国瑟堡和爱尔兰昆士敦,那么在昆士敦之前上船的人,有可能在瑟堡或昆士敦下船,这些人将不会遇到海难。
sns.countplot('Embarked', hue='Survived', data=train)
plt.title('Embarked and Survived')
#可知,在C港口上船的人生存率最高,S港口上船的人生存概率最低。
train[['Embarked','Survived']].groupby(['Embarked']).mean.plot.bar

据了解,泰坦尼克号上共有2224名乘客。本训练数据只给出了891名乘客的信息,如果该数据集是从总共的2224人中随机选出的,根据中心极限定理,该样本的数据也足够大,那么我们的分析结果就具有代表性;但如果不是随机选取,那么我们的分析结果就可能不太靠谱了。

其他可能和存活有关系的特征

对于数据集中没有给出的特征信息,我们还可以联想其他可能会对模型产生影响的特征因素。如:乘客的国籍、乘客的身高、乘客的体重、乘客是否会游泳、乘客职业等等。

另外还有数据集中没有分析的几个特征:Ticket(船票号)、Cabin(船舱号),这些因素的不同可能会影响乘客在船中的位置从而影响逃生的顺序。但是船舱号数据缺失,船票号类别大,难以分析规律,所以在后期模型融合的时候,将这些因素交由模型来决定其重要性。

非数值特征的转换

1. Dummy Variables

当某些类别变量出现次数不太多的时候,Dummy Variables比较适合。以'Embarked'为例,Embarked只包含三个值’S’,’C’,’Q’,我们可以使用下面的代码将其转换为dummies:

embark_dummies = pd.get_dummies(train['Embarked'])
train = train.join(embark_dummies)
train.drop(['Embarked'], axis=1,inplace=True)
embark_dummies = train[['S', 'C', 'Q']]
embark_dummies.head

2. Factoring

当类别变量太多的时候,不宜使用Dummy Variables。此时,使用pandas提供的factorize,将其每个特征映射为一个ID。以”Cabin”为例:

# 用"U0"代替‘Cabin’缺失值
train['Cabin'][train.Cabin.is()] = 'U0'
train['CabinLetter'] = train['Cabin'].map( lambda x : re.compile("([a-zA-Z]+)").search(x).group)
train['CabinLetter'] = pd.factorize(train['CabinLetter'])[0]

3.Scaling

当某些特征的变化范围太大,使用scaling将特征映射到更小的范围,通常是(-1,1)。以"Age"为例:

from sklearn import preprocessing
assert np.size(train['Age']) == 891
scaler = preprocessing.StandardScaler
train['Age_scaled'] = scaler.fit_transform(train['Age'].values.reshape(-1, 1))

4. Binning

将"相似"的数据进行划分(类似于聚类),Binning后,要么factorize化,要么dummies化。以“Fare”为例:

# factorize
train['Fare_bin'] = pd.qcut(train['Fare'], 5)
train['Fare_bin'].head
train['Fare_bin_id'] = pd.factorize(train['Fare_bin'])[0]

# dummies
fare_bin_dummies_df = pd.get_dummies(train['Fare_bin']).rename(columns=lambda x: 'Fare_' + str(x))
# pd.concat([X1, X1], axis=1)按行拼接起来,行不变,列数相加。
train_data = pd.concat([train, fare_bin_dummies_df], axis=1)

特征工程

在进行特征工程时,需要将训练集和测试集一起处理,使两者具有相同的数据分布和数据类型。

#1.合并训练集和测试集
train_df_org = pd.read_csv('train.csv')
test_df_org = pd.read_csv('test.csv')
test_df_org['Survived'] = 0
combined_train_test = train_df_org.append(test_df_org)
PassengerId = test_df_org['PassengerId']

#2. “Embarked”项的缺失值不多,以众数来填充
# 为了后面的特征分析,这里将 Embarked 特征进行facrorizing
combined_train_test['Embarked'] = pd.factorize(combined_train_test['Embarked'])[0]
# 使用 pd.get_dummies 获取one-hot 编码
emb_dummies_df = pd.get_dummies(combined_train_test['Embarked'], prefix=combined_train_test[['Embarked']].columns[0])
combined_train_test = pd.concat([combined_train_test, emb_dummies_df], axis=1)

#3. 对Sex,Pclass也进行dummy处理码。

#4. Fare项在测试数据中缺少一个值,以平均值填充。

#5. Pclass将其转换为dummy形式即可。但是为了更好的分析问题,我们这里假设对于不同等级的船舱,各船舱内部的票价也说明了各等级舱的位置,那么也就很有可能与逃生的顺序有关系。所以这里分出每等舱里的高价和低价位。
from sklearn.preprocessing import LabelEncoder

# 建立PClass Fare Category
def pclass_fare_category(df, pclass1_mean_fare, pclass2_mean_fare, pclass3_mean_fare):
if df['Pclass'] == 1:
if df['Fare'] <= pclass1_mean_fare:
return 'Pclass1_Low'
else:
return 'Pclass1_High'
elif df['Pclass'] == 2:
if df['Fare'] <= pclass2_mean_fare:
return 'Pclass2_Low'
else:
return 'Pclass2_High'
elif df['Pclass'] == 3:
if df['Fare'] <= pclass3_mean_fare:
return 'Pclass3_Low'
else:
return 'Pclass3_High'

Pclass1_mean_fare = combined_train_test['Fare'].groupby(by=combined_train_test['Pclass']).mean.get([1]).values[0]
Pclass2_mean_fare = combined_train_test['Fare'].groupby(by=combined_train_test['Pclass']).mean.get([2]).values[0]
Pclass3_mean_fare = combined_train_test['Fare'].groupby(by=combined_train_test['Pclass']).mean.get([3]).values[0]

# 建立Pclass_Fare Category
combined_train_test['Pclass_Fare_Category'] = combined_train_test.apply(pclass_fare_category, args=(
Pclass1_mean_fare, Pclass2_mean_fare, Pclass3_mean_fare), axis=1)
pclass_level = LabelEncoder

# 给每一项添加标签
pclass_level.fit(np.array(
['Pclass1_Low', 'Pclass1_High', 'Pclass2_Low', 'Pclass2_High', 'Pclass3_Low', 'Pclass3_High']))

# 转换成数值
combined_train_test['Pclass_Fare_Category'] = pclass_level.transform(combined_train_test['Pclass_Fare_Category'])

# dummy 转换
pclass_dummies_df = pd.get_dummies(combined_train_test['Pclass_Fare_Category']).rename(columns=lambda x: 'Pclass_' + str(x))
combined_train_test = pd.concat([combined_train_test, pclass_dummies_df], axis=1)
#6. 亲友的数量太少或者太多会影响到Survived。所以将二者合并为FamliySize这一组合项,同时也保留这两项。
def family_size_category(family_size):
if family_size <= 1:
return 'Single'
elif family_size <= 4:
return 'Small_Family'
else:
return 'Large_Family'

combined_train_test['Family_Size'] = combined_train_test['Parch'] + combined_train_test['SibSp'] + 1
combined_train_test['Family_Size_Category'] = combined_train_test['Family_Size'].map(family_size_category)

le_family = LabelEncoder
le_family.fit(np.array(['Single', 'Small_Family', 'Large_Family']))
combined_train_test['Family_Size_Category'] = le_family.transform(combined_train_test['Family_Size_Category'])

family_size_dummies_df = pd.get_dummies(combined_train_test['Family_Size_Category'],
prefix=combined_train_test[['Family_Size_Category']].columns[0])
combined_train_test = pd.concat([combined_train_test, family_size_dummies_df], axis=1)
#7. 建立模型对"Age"做预测
missing_age_df = pd.DataFrame(combined_train_test[['Age', 'Embarked', 'Sex', 'Title', 'Name_length', 'Family_Size', 'Family_Size_Category','Fare', 'Fare_bin_id', 'Pclass']])
missing_age_train = missing_age_df[missing_age_df['Age'].not]
missing_age_test = missing_age_df[missing_age_df['Age'].is]

from sklearn import ensemble
from sklearn import model_selection
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.ensemble import RandomForestRegressor

def fill_missing_age(missing_age_train, missing_age_test):
missing_age_X_train = missing_age_train.drop(['Age'], axis=1)
missing_age_Y_train = missing_age_train['Age']
missing_age_X_test = missing_age_test.drop(['Age'], axis=1)

# GBM模型
gbm_reg = GradientBoostingRegressor(random_state=42)
gbm_reg_param_grid = {'n_estimators': [2000], 'max_depth': [4], 'learning_rate': [0.01], 'max_features': [3]}
gbm_reg_grid = model_selection.GridSearchCV(gbm_reg, gbm_reg_param_grid, cv=10, n_jobs=25, verbose=1, scoring='neg_mean_squared_error')
gbm_reg_grid.fit(missing_age_X_train, missing_age_Y_train)
print('Age feature Best GB Params:' + str(gbm_reg_grid.best_params_))
print('Age feature Best GB Score:' + str(gbm_reg_grid.best_score_))
print('GB Train Error for "Age" Feature Regressor:' + str(gbm_reg_grid.score(missing_age_X_train, missing_age_Y_train)))
missing_age_test.loc[:, 'Age_GB'] = gbm_reg_grid.predict(missing_age_X_test)
print(missing_age_test['Age_GB'][:4])

# RF模型
rf_reg = RandomForestRegressor
rf_reg_param_grid = {'n_estimators': [200], 'max_depth': [5], 'random_state': [0]}
rf_reg_grid = model_selection.GridSearchCV(rf_reg, rf_reg_param_grid, cv=10, n_jobs=25, verbose=1, scoring='neg_mean_squared_error')
rf_reg_grid.fit(missing_age_X_train, missing_age_Y_train)
print('Age feature Best RF Params:' + str(rf_reg_grid.best_params_))
print('Age feature Best RF Score:' + str(rf_reg_grid.best_score_))
print('RF Train Error for "Age" Feature Regressor' + str(rf_reg_grid.score(missing_age_X_train, missing_age_Y_train)))
missing_age_test.loc[:, 'Age_RF'] = rf_reg_grid.predict(missing_age_X_test)
print(missing_age_test['Age_RF'][:4])

# 模型融合
print('shape1', missing_age_test['Age'].shape, missing_age_test[['Age_GB', 'Age_RF']].mode(axis=1).shape)
# missing_age_test['Age'] = missing_age_test[['Age_GB', 'Age_LR']].mode(axis=1)

missing_age_test.loc[:, 'Age'] = np.mean([missing_age_test['Age_GB'], missing_age_test['Age_RF']])
print(missing_age_test['Age'][:4])

missing_age_test.drop(['Age_GB', 'Age_RF'], axis=1, inplace=True)

return missing_age_test

combined_train_test.loc[(combined_train_test.Age.is()), 'Age'] = fill_missing_age(missing_age_train, missing_age_test)

#8. Cabin缺失值太多,去掉。
#9. 观察Ticket的值,我们可以看到,Ticket有字母和数字之分,而对于不同的字母,可能在很大程度上就意味着船舱等级或者不同船舱的位置,也会对Survived产生一定的影响,所以
#我们将Ticket中的字母分开,为数字的部分则分为一类。

ombined_train_test['Ticket_Letter'] = combined_train_test['Ticket'].str.split.str[0]
combined_train_test['Ticket_Letter'] = combined_train_test['Ticket_Letter'].apply(lambda x: 'U0' if x.isnumeric else x)

# 如果要提取数字信息,则也可以这样做,现在我们对数字票单纯地分为一类。
# combined_train_test['Ticket_Number'] = combined_train_test['Ticket'].apply(lambda x: pd.to_numeric(x, errors='coerce'))
# combined_train_test['Ticket_Number'].fillna(0, inplace=True)

# 将 Ticket_Letter factorize
combined_train_test['Ticket_Letter'] = pd.factorize(combined_train_test['Ticket_Letter'])[0]

Pearson关联图

Correlation = pd.DataFrame(combined_train_test[['Embarked', 'Sex', 'Title', 'Name_length', 'Family_Size', 'Family_Size_Category','Fare', 'Fare_bin_id', 'Pclass', 'Pclass_Fare_Category', 'Age', 'Ticket_Letter', 'Cabin']])

#训练集和测试集上的关联图
colormap = plt.cm.viridis
plt.figure(figsize=(14,12))
plt.title('Pearson Correlation of Features', y=1.05, size=15)
sns.heatmap(Correlation.astype(float).corr,linewidths=0.1,vmax=1.0, square=True, cmap=colormap, linecolor='white', annot=True)

#训练集上的关联图
colormap = plt.cm.RdBu
plt.figure(figsize=(14,12))
plt.title('Pearson Correlation of Features', y=1.05, size=15)
sns.heatmap(train.astype(float).corr,linewidths=0.1,vmax=1.0, square=True, cmap=colormap, linecolor='white', annot=True)

结论:Pearson Correlation图可以告诉我们:没有太多的特征彼此强烈相关,即训练集没有太多冗余或多余的数据。 以下是两个最相关的特征:FamilySize和Parch相关性较强。 为了本练习的目的,我仍然会将这两个特征都保留下来。

Pairplots图

#训练集上的Pairplot
g = sns.pairplot(train[[u'Survived', u'Pclass', u'Sex', u'Age', u'Parch', u'Fare', u'Embarked',u'FamilySize', u'Title']], hue='Survived', palette = 'seismic',size=1.2,diag_kind = 'kde',diag_kws=dict(shade=True),plot_kws=dict(s=10) )
g.set(xticklabels=[])

训练模型做预测

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.metrics import accuracy_score, log_loss
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis, QuadraticDiscriminantAnalysis
from sklearn.linear_model import LogisticRegression

classifiers = [
KNeighborsClassifier(3),
SVC(probability=True),
DecisionTreeClassifier(),
RandomForestClassifier(),
AdaBoostClassifier(),
GradientBoostingClassifier(),
GaussianNB(),
LinearDiscriminantAnalysis(),
QuadraticDiscriminantAnalysis(),
LogisticRegression()]

log_cols = ["Classifier", "Accuracy"]
log = pd.DataFrame(columns=log_cols)

sss = StratifiedShuffleSplit(n_splits=10, test_size=0.1, random_state=0) #划分成10份,

X = train[0::, 1::] #train的第一行开始,一直到最后一行,从第2列开始,直到最后一列,:表示以1递增。
y = train[0::, 0]

acc_dict = {}

for train_index, test_index in sss.split(X, y):
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]

for clf in classifiers:
name = clf.__class__.__name__
clf.fit(X_train, y_train)
train_predictions = clf.predict(X_test)
acc = accuracy_score(y_test, train_predictions)
if name in acc_dict:
acc_dict[name] += acc
else:
acc_dict[name] = acc

for clf in acc_dict:
acc_dict[clf] = acc_dict[clf] / 10.0
log_entry = pd.DataFrame([[clf, acc_dict[clf]]], columns=log_cols)
log = log.append(log_entry)

plt.xlabel('Accuracy')
plt.title('Classifier Accuracy')

sns.set_color_codes("muted")
sns.barplot(x='Accuracy', y='Classifier', data=log, color="b")

模型融合与测试

1.利用不同的模型选择特征

from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.tree import DecisionTreeClassifier

#返回每个模型选择的最重要的top_n_features个特征,返回的特征相当于5个模型选择的特征的并集,以及返回所有特征及对应分数的字典。
#输出最重要的是个特征。
def get_top_n_features(titanic_train_data_X, titanic_train_data_Y, top_n_features):

# RF
rf_est = RandomForestClassifier(random_state=0)
rf_param_grid = {'n_estimators': [500], 'min_samples_split': [2, 3], 'max_depth': [20]}
rf_grid = model_selection.GridSearchCV(rf_est, rf_param_grid, n_jobs=25, cv=10, verbose=1)
rf_grid.fit(titanic_train_data_X, titanic_train_data_Y)
print('Top N Features Best RF Params:' + str(rf_grid.best_params_))
print('Top N Features Best RF Score:' + str(rf_grid.best_score_))
print('Top N Features RF Train Score:' + str(rf_grid.score(titanic_train_data_X, titanic_train_data_Y)))
feature_imp_sorted_rf = pd.DataFrame({'feature': list(titanic_train_data_X),
'importance': rf_grid.best_estimator_.feature_importances_}).sort_values('importance', ascending=False)
#feature_imp_sorted_rf的前top_n_features个特征。
features_top_n_rf = feature_imp_sorted_rf.head(top_n_features)['feature']
print('Sample 10 Features from RF Classifier')
print(str(features_top_n_rf[:10]))

# AdaBoost
ada_est =AdaBoostClassifier(random_state=0)
ada_param_grid = {'n_estimators': [500], 'learning_rate': [0.01, 0.1]}
ada_grid = model_selection.GridSearchCV(ada_est, ada_param_grid, n_jobs=25, cv=10, verbose=1)
ada_grid.fit(titanic_train_data_X, titanic_train_data_Y)
print('Top N Features Best Ada Params:' + str(ada_grid.best_params_))
print('Top N Features Best Ada Score:' + str(ada_grid.best_score_))
print('Top N Features Ada Train Score:' + str(ada_grid.score(titanic_train_data_X, titanic_train_data_Y)))
feature_imp_sorted_ada = pd.DataFrame({'feature': list(titanic_train_data_X),
'importance': ada_grid.best_estimator_.feature_importances_}).sort_values('importance', ascending=False)
features_top_n_ada = feature_imp_sorted_ada.head(top_n_features)['feature']
print('Sample 10 Feature from Ada Classifier:')
print(str(features_top_n_ada[:10]))

# ExtraTree
et_est = ExtraTreesClassifier(random_state=0)
et_param_grid = {'n_estimators': [500], 'min_samples_split': [3, 4], 'max_depth': [20]}
et_grid = model_selection.GridSearchCV(et_est, et_param_grid, n_jobs=25, cv=10, verbose=1)
et_grid.fit(titanic_train_data_X, titanic_train_data_Y)
print('Top N Features Best ET Params:' + str(et_grid.best_params_))
print('Top N Features Best ET Score:' + str(et_grid.best_score_))
print('Top N Features ET Train Score:' + str(et_grid.score(titanic_train_data_X, titanic_train_data_Y)))
feature_imp_sorted_et = pd.DataFrame({'feature': list(titanic_train_data_X),
'importance': et_grid.best_estimator_.feature_importances_}).sort_values('importance', ascending=False)
features_top_n_et = feature_imp_sorted_et.head(top_n_features)['feature']
print('Sample 10 Features from ET Classifier:')
print(str(features_top_n_et[:10]))

# GradientBoosting
gb_est =GradientBoostingClassifier(random_state=0)
gb_param_grid = {'n_estimators': [500], 'learning_rate': [0.01, 0.1], 'max_depth': [20]}
gb_grid = model_selection.GridSearchCV(gb_est, gb_param_grid, n_jobs=25, cv=10, verbose=1)
gb_grid.fit(titanic_train_data_X, titanic_train_data_Y)
print('Top N Features Best GB Params:' + str(gb_grid.best_params_))
print('Top N Features Best GB Score:' + str(gb_grid.best_score_))
print('Top N Features GB Train Score:' + str(gb_grid.score(titanic_train_data_X, titanic_train_data_Y)))
feature_imp_sorted_gb = pd.DataFrame({'feature': list(titanic_train_data_X),
'importance': gb_grid.best_estimator_.feature_importances_}).sort_values('importance', ascending=False)
features_top_n_gb = feature_imp_sorted_gb.head(top_n_features)['feature']
print('Sample 10 Feature from GB Classifier:')
print(str(features_top_n_gb[:10]))

# DecisionTree
dt_est = DecisionTreeClassifier(random_state=0)
dt_param_grid = {'min_samples_split': [2, 4], 'max_depth': [20]}
dt_grid = model_selection.GridSearchCV(dt_est, dt_param_grid, n_jobs=25, cv=10, verbose=1)
dt_grid.fit(titanic_train_data_X, titanic_train_data_Y)
print('Top N Features Best DT Params:' + str(dt_grid.best_params_))
print('Top N Features Best DT Score:' + str(dt_grid.best_score_))
print('Top N Features DT Train Score:' + str(dt_grid.score(titanic_train_data_X, titanic_train_data_Y)))
feature_imp_sorted_dt = pd.DataFrame({'feature': list(titanic_train_data_X),
'importance': dt_grid.best_estimator_.feature_importances_}).sort_values('importance', ascending=False)
features_top_n_dt = feature_imp_sorted_dt.head(top_n_features)['feature']
print('Sample 10 Features from DT Classifier:')
print(str(features_top_n_dt[:10]))

# 合并5个模型,删掉重复的项,留下一项。即找出所有模型中重要的特征。
features_top_n = pd.concat([features_top_n_rf, features_top_n_ada, features_top_n_et, features_top_n_gb, features_top_n_dt],
ignore_index=True).drop_duplicates

features_importance = pd.concat([feature_imp_sorted_rf, feature_imp_sorted_ada, feature_imp_sorted_et,
feature_imp_sorted_gb, feature_imp_sorted_dt],ignore_index=True)

return features_top_n , features_importance

上述代码可简化如下:

def get_top_n_features(clf,params,cv,X_train,y_train, top_n_features):
grid_search = model_selection.GridSearchCV(clf, params, n_jobs=-1, cv=cv, verbose=1)
grid_search.fit(X_train, y_train)
print('Top N Features Best {} Params:{}'.format(clf.__class__.__name__,str(grid_search.best_params_)))
print('Top N Features Best {} Score:{}'.format(clf.__class__.__name__,str(grid_search.best_score_)))
print('Top N Features {} Train Score:{}'.format(clf.__class__.__name__,str(grid_search.score(X_train, y_train))))
#X_train使pd类型,list(X_train)给出pd结构的特征名。
feature_imp_sorted = pd.DataFrame({'feature': list(X_train),
'importance': grid_search.best_estimator_.feature_importances_}).sort_values('importance', ascending=False)
features_top_n = feature_imp_sorted.head(top_n_features)['feature']
print('Sample 10 Features from RF Classifier')
print(str(features_top_n[:10]))
return features_top_n , features_importance

2.根据筛选的特征构造训练集和测试集

#特征工程可能会产生了大量的特征,而特征与特征之间会存在一定的相关性。太多的特征不
#仅会影响模型训练的速度,也可能使模型过拟合。所以在特征太多的情况下,我们可以利用
#不同的模型对特征进行筛选,选取出我们想要的前n个特征。
feature_to_pick = 30
feature_top_n, feature_importance = get_top_n_features(titanic_train_data_X, titanic_train_data_Y, feature_to_pick)
titanic_train_data_X = pd.DataFrame(titanic_train_data_X[feature_top_n])
titanic_test_data_X = pd.DataFrame(titanic_test_data_X[feature_top_n])

3. 可视化特征重要性

rf_feature_imp = feature_importance[:10]
Ada_feature_imp = feature_importance[32:32+10].reset_index(drop=True)

# 计算特征相对重要性
rf_feature_importance = 100.0 * (rf_feature_imp['importance'] / rf_feature_imp['importance'].max)
Ada_feature_importance = 100.0 * (Ada_feature_imp['importance'] / Ada_feature_imp['importance'].max)

# 获得所有目标特征的索引
rf_important_idx = np.where(rf_feature_importance)[0]
Ada_important_idx = np.where(Ada_feature_importance)[0]

# Adapted from http://scikit-learn.org/stable/auto_examples/ensemble/plot_gradient_boosting_regression.html
pos = np.arange(rf_important_idx.shape[0]) + .5

plt.figure(1, figsize = (18, 8))

plt.subplot(121)
plt.barh(pos, rf_feature_importance[rf_important_idx][::-1])
plt.yticks(pos, rf_feature_imp['feature'][::-1])
plt.xlabel('Relative Importance')
plt.title('RandomForest Feature Importance')

plt.subplot(122)
plt.barh(pos, Ada_feature_importance[Ada_important_idx][::-1])
plt.yticks(pos, Ada_feature_imp['feature'][::-1])
plt.xlabel('Relative Importance')
plt.title('AdaBoost Feature Importance')

plt.show

模型融合

常见的模型融合方法有:Bagging、Boosting、Stacking、Blending。

Bagging

Bagging 将多个模型,也就是多个基学习器的预测结果进行简单的加权平均或者投票。它的好处是可以并行地训练基学习器。Random Forest就用到了Bagging的思想。

Boosting

Boosting 的思想有点像知错能改,每个基学习器是在上一个基学习器学习的基础上,对上一个基学习器的错误进行弥补。我们将会用到的 AdaBoost,Gradient Boost 就用到了这种思想。

Stacking

Stacking是用新的次学习器去学习如何组合上一层的基学习器。如果把 Bagging 看作是多个基分类器的线性组合,那么Stacking就是多个基分类器的非线性组合。Stacking可以将学习器一层一层地堆砌起来,形成一个网状的结构。

相比来说Stacking的融合框架相对前面的二者来说在精度上确实有一定的提升,所以在下面的模型融合上,我们也使用Stacking方法。

Blending

Blending 和 Stacking 很相似,但同时它可以防止信息泄露的问题。

本例的Stacking融合:两层结构,RandomForest、AdaBoost、ExtraTrees、GBDT、DecisionTree、KNN、SVM ,一共7个模型的预测结果作为输入,再利用XGBoost模型训练做预测。

from sklearn.model_selection import KFold

ntrain = titanic_train_data_X.shape[0]
ntest = titanic_test_data_X.shape[0]
SEED = 0 # for reproducibility
NFOLDS = 7 # set folds for out-of-fold prediction
kf = KFold(n_splits = NFOLDS, random_state=SEED, shuffle=False)

def get_out_fold(clf, x_train, y_train, x_test):
oof_train = np.zeros((ntrain,))
oof_test = np.zeros((ntest,))
oof_test_skf = np.empty((NFOLDS, ntest))

for i, (train_index, test_index) in enumerate(kf.split(x_train)):
x_tr = x_train[train_index]
y_tr = y_train[train_index]
x_te = x_train[test_index]

clf.fit(x_tr, y_tr)

oof_train[test_index] = clf.predict(x_te)
oof_test_skf[i, :] = clf.predict(x_test)

oof_test[:] = oof_test_skf.mean(axis=0)
return oof_train.reshape(-1, 1), oof_test.reshape(-1, 1)

from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC

rf = RandomForestClassifier(n_estimators=500, warm_start=True, max_features='sqrt',max_depth=6,
min_samples_split=3, min_samples_leaf=2, n_jobs=-1, verbose=0)

ada = AdaBoostClassifier(n_estimators=500, learning_rate=0.1)

et = ExtraTreesClassifier(n_estimators=500, n_jobs=-1, max_depth=8, min_samples_leaf=2, verbose=0)

gb = GradientBoostingClassifier(n_estimators=500, learning_rate=0.008, min_samples_split=3, min_samples_leaf=2, max_depth=5, verbose=0)

dt = DecisionTreeClassifier(max_depth=8)

knn = KNeighborsClassifier(n_neighbors = 2)

svm = SVC(kernel='linear', C=0.025)

x_train = titanic_train_data_X.values # Creates an array of the train data
x_test = titanic_test_data_X.values # Creats an array of the test data
y_train = titanic_train_data_Y.values

# Create our OOF train and test predictions. These base results will be used as new features
rf_oof_train, rf_oof_test = get_out_fold(rf, x_train, y_train, x_test) # Random Forest
ada_oof_train, ada_oof_test = get_out_fold(ada, x_train, y_train, x_test) # AdaBoost
et_oof_train, et_oof_test = get_out_fold(et, x_train, y_train, x_test) # Extra Trees
gb_oof_train, gb_oof_test = get_out_fold(gb, x_train, y_train, x_test) # Gradient Boost
dt_oof_train, dt_oof_test = get_out_fold(dt, x_train, y_train, x_test) # Decision Tree
knn_oof_train, knn_oof_test = get_out_fold(knn, x_train, y_train, x_test) # KNeighbors
svm_oof_train, svm_oof_test = get_out_fold(svm, x_train, y_train, x_test) # Support Vector

print("Training is complete")

x_train = np.concatenate((rf_oof_train, ada_oof_train, et_oof_train, gb_oof_train, dt_oof_train, knn_oof_train, svm_oof_train), axis=1)
x_test = np.concatenate((rf_oof_test, ada_oof_test, et_oof_test, gb_oof_test, dt_oof_test, knn_oof_test, svm_oof_test), axis=1)

from xgboost import XGBClassifier

gbm = XGBClassifier( n_estimators= 2000, max_depth= 4, min_child_weight= 2, gamma=0.9, subsample=0.8,
colsample_bytree=0.8, objective= 'binary:logistic', nthread= -1, scale_pos_weight=1).fit(x_train, y_train)
predictions = gbm.predict(x_test)

提交预测结果:

Submission = pd.DataFrame({'PassengerId': PassengerId, 'Survived': predictions})
Submission.to_csv('StackingSubmission.csv',index=False,sep=',')

最后,编写了一个SklearnHelper类,允许我们方便调用不同的模型。

class SklearnHelper(object):
def __init__(self, clf, seed=0, params=None):
params['random_state'] = seed
self.clf = clf(**params)

def train(self, x_train, y_train):
self.clf.fit(x_train, y_train)

def predict(self, x):
return self.clf.predict(x)

def fit(self,x,y):
return self.clf.fit(x,y)

本文主要内容来自于:

https://www.kaggle.com/arthurtok/introduction-to-ensembling-stacking-in-python/notebook
https://blog.csdn.net/Koala_Tree/article/details/78725881
https://www.kaggle.com/mmueller/stacking-starter

Tags:

最近发表
标签列表