数据预处理
1 | # 确定哪些包是安装好的 |
1 | data_train =pd.read_csv('./train.csv') |
1 | data_train.info() |
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 800000 entries, 0 to 799999
Data columns (total 47 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 800000 non-null int64
1 loanAmnt 800000 non-null float64
2 term 800000 non-null int64
3 interestRate 800000 non-null float64
4 installment 800000 non-null float64
5 grade 800000 non-null object
6 subGrade 800000 non-null object
7 employmentTitle 799999 non-null float64
8 employmentLength 753201 non-null object
9 homeOwnership 800000 non-null int64
10 annualIncome 800000 non-null float64
11 verificationStatus 800000 non-null int64
12 issueDate 800000 non-null object
13 isDefault 800000 non-null int64
14 purpose 800000 non-null int64
15 postCode 799999 non-null float64
16 regionCode 800000 non-null int64
17 dti 799761 non-null float64
18 delinquency_2years 800000 non-null float64
19 ficoRangeLow 800000 non-null float64
20 ficoRangeHigh 800000 non-null float64
21 openAcc 800000 non-null float64
22 pubRec 800000 non-null float64
23 pubRecBankruptcies 799595 non-null float64
24 revolBal 800000 non-null float64
25 revolUtil 799469 non-null float64
26 totalAcc 800000 non-null float64
27 initialListStatus 800000 non-null int64
28 applicationType 800000 non-null int64
29 earliesCreditLine 800000 non-null object
30 title 799999 non-null float64
31 policyCode 800000 non-null float64
32 n0 759730 non-null float64
33 n1 759730 non-null float64
34 n2 759730 non-null float64
35 n2.1 759730 non-null float64
36 n4 766761 non-null float64
37 n5 759730 non-null float64
38 n6 759730 non-null float64
39 n7 759730 non-null float64
40 n8 759729 non-null float64
41 n9 759730 non-null float64
42 n10 766761 non-null float64
43 n11 730248 non-null float64
44 n12 759730 non-null float64
45 n13 759730 non-null float64
46 n14 759730 non-null float64
dtypes: float64(33), int64(9), object(5)
memory usage: 286.9+ MB
缺失值填充
由于EDA中对缺失的数据发现还有很多,尝试多种缺失填充然后比较结果选择结果最优的一种
1 | # 找到数值型的特征 |
1 | print(len(numerical_fea)) |
41
有三种缺失值填充的方式
- 缺失值替换为指定的值(’0’)
- 用上面的值替换缺失值
- 纵向用下面的值来替换,且最多只填充两个连续的
1 | # 显示缺失值的情况 |
id 0
loanAmnt 0
term 0
interestRate 0
installment 0
grade 0
subGrade 0
employmentTitle 1
employmentLength 46799
homeOwnership 0
annualIncome 0
verificationStatus 0
issueDate 0
isDefault 0
purpose 0
postCode 1
regionCode 0
dti 239
delinquency_2years 0
ficoRangeLow 0
ficoRangeHigh 0
openAcc 0
pubRec 0
pubRecBankruptcies 405
revolBal 0
revolUtil 531
totalAcc 0
initialListStatus 0
applicationType 0
earliesCreditLine 0
title 1
policyCode 0
n0 40270
n1 40270
n2 40270
n2.1 40270
n4 33239
n5 40270
n6 40270
n7 40270
n8 40271
n9 40270
n10 33239
n11 69752
n12 40270
n13 40270
n14 40270
dtype: int64
1 | fill_type = 0 |
1 | #按照平均数填充数值型特征 |
1 | # 处理没有结构化的数据 'issueDate' |
1 | data_train['employmentLength'].value_counts(dropna=False).sort_index() |
1 | data['employmentLength'].value_counts(dropna=False).sort_index() |
0.0 15989
1.0 13182
2.0 18207
3.0 16011
4.0 11833
5.0 12543
6.0 9328
7.0 8823
8.0 8976
9.0 7594
10.0 65772
NaN 11742
Name: employmentLength, dtype: int64
‘employmentLength’这个特征表示借款人最早报告的信用额度开立的月份 存储的时候是按照str存储的
1 | data_train['earliesCreditLine'].sample(5) |
631614 May-2002
657973 Mar-1995
518715 Jan-2003
16630 Apr-1993
308502 Jun-2009
Name: earliesCreditLine, dtype: object
1 | # 对于这个类型的特征由于时间跨度大所以只需要年份就行了 |
处理一些类别特征
用nunique() 可以找到不重复的类别有几种
grade 类型数: 7
subGrade 类型数: 35
employmentTitle 类型数: 79282
homeOwnership 类型数: 6
verificationStatus 类型数: 3
purpose 类型数: 14
postCode 类型数: 889
regionCode 类型数: 51
applicationType 类型数: 2
initialListStatus 类型数: 2
title 类型数: 12058
policyCode 类型数: 1
1 | semantic_dict = { |
1 | # 把grad映射一下 |
异常值处理
- 首先,如果这一异常值并不代表一种规律性的,而是极其偶然的现象,或者说你并不想研究这种偶然的现象,这时可以将其删除。
- 其次,如果异常值存在且代表了一种真实存在的现象,那就不能随便删除。在现有的欺诈场景中很多时候欺诈数据本身相对于正常数据说就是异常的,我们要把这些异常点纳入,重新拟合模型,研究其规律。能用监督的用监督模型,不能用的还可以考虑用异常检测的算法来做。
- 注意test的数据不能删。
让我们来分析一下数值型数据的异常值
1 | # 使用3sigma分析来找异常值,此时的data是[200000, 148], 增加的列编程了 |
正常值 800000
Name: id_outliers, dtype: int64
id_outliers
正常值 159610
Name: isDefault, dtype: int64
**********
正常值 800000
Name: loanAmnt_outliers, dtype: int64
loanAmnt_outliers
正常值 159610
Name: isDefault, dtype: int64
**********
正常值 800000
Name: term_outliers, dtype: int64
term_outliers
正常值 159610
Name: isDefault, dtype: int64
**********
正常值 794259
异常值 5741
Name: interestRate_outliers, dtype: int64
interestRate_outliers
异常值 2916
正常值 156694
Name: isDefault, dtype: int64
**********
正常值 792046
异常值 7954
Name: installment_outliers, dtype: int64
installment_outliers
异常值 2152
正常值 157458
Name: isDefault, dtype: int64
**********
正常值 800000
Name: employmentTitle_outliers, dtype: int64
employmentTitle_outliers
正常值 159610
Name: isDefault, dtype: int64
**********
正常值 799701
异常值 299
Name: homeOwnership_outliers, dtype: int64
homeOwnership_outliers
异常值 62
正常值 159548
Name: isDefault, dtype: int64
**********
正常值 793973
异常值 6027
Name: annualIncome_outliers, dtype: int64
annualIncome_outliers
异常值 756
正常值 158854
Name: isDefault, dtype: int64
**********
正常值 800000
Name: verificationStatus_outliers, dtype: int64
verificationStatus_outliers
正常值 159610
Name: isDefault, dtype: int64
**********
正常值 783003
异常值 16997
Name: purpose_outliers, dtype: int64
purpose_outliers
异常值 3635
正常值 155975
Name: isDefault, dtype: int64
**********
正常值 798931
异常值 1069
Name: postCode_outliers, dtype: int64
postCode_outliers
异常值 221
正常值 159389
Name: isDefault, dtype: int64
**********
正常值 799994
异常值 6
Name: regionCode_outliers, dtype: int64
regionCode_outliers
异常值 1
正常值 159609
Name: isDefault, dtype: int64
**********
正常值 798440
异常值 1560
Name: dti_outliers, dtype: int64
dti_outliers
异常值 466
正常值 159144
Name: isDefault, dtype: int64
**********
正常值 778245
异常值 21755
Name: delinquency_2years_outliers, dtype: int64
delinquency_2years_outliers
异常值 5089
正常值 154521
Name: isDefault, dtype: int64
**********
正常值 788261
异常值 11739
Name: ficoRangeLow_outliers, dtype: int64
ficoRangeLow_outliers
异常值 778
正常值 158832
Name: isDefault, dtype: int64
**********
正常值 788261
异常值 11739
Name: ficoRangeHigh_outliers, dtype: int64
ficoRangeHigh_outliers
异常值 778
正常值 158832
Name: isDefault, dtype: int64
**********
正常值 790889
异常值 9111
Name: openAcc_outliers, dtype: int64
openAcc_outliers
异常值 2195
正常值 157415
Name: isDefault, dtype: int64
**********
正常值 792471
异常值 7529
Name: pubRec_outliers, dtype: int64
pubRec_outliers
异常值 1701
正常值 157909
Name: isDefault, dtype: int64
**********
正常值 794120
异常值 5880
Name: pubRecBankruptcies_outliers, dtype: int64
pubRecBankruptcies_outliers
异常值 1423
正常值 158187
Name: isDefault, dtype: int64
**********
正常值 790001
异常值 9999
Name: revolBal_outliers, dtype: int64
revolBal_outliers
异常值 1359
正常值 158251
Name: isDefault, dtype: int64
**********
正常值 799948
异常值 52
Name: revolUtil_outliers, dtype: int64
revolUtil_outliers
异常值 23
正常值 159587
Name: isDefault, dtype: int64
**********
正常值 791663
异常值 8337
Name: totalAcc_outliers, dtype: int64
totalAcc_outliers
异常值 1668
正常值 157942
Name: isDefault, dtype: int64
**********
正常值 800000
Name: initialListStatus_outliers, dtype: int64
initialListStatus_outliers
正常值 159610
Name: isDefault, dtype: int64
**********
正常值 784586
异常值 15414
Name: applicationType_outliers, dtype: int64
applicationType_outliers
异常值 3875
正常值 155735
Name: isDefault, dtype: int64
**********
正常值 775134
异常值 24866
Name: title_outliers, dtype: int64
title_outliers
异常值 3900
正常值 155710
Name: isDefault, dtype: int64
**********
正常值 800000
Name: policyCode_outliers, dtype: int64
policyCode_outliers
正常值 159610
Name: isDefault, dtype: int64
**********
正常值 782773
异常值 17227
Name: n0_outliers, dtype: int64
n0_outliers
异常值 3485
正常值 156125
Name: isDefault, dtype: int64
**********
正常值 790500
异常值 9500
Name: n1_outliers, dtype: int64
n1_outliers
异常值 2491
正常值 157119
Name: isDefault, dtype: int64
**********
正常值 789067
异常值 10933
Name: n2_outliers, dtype: int64
n2_outliers
异常值 3205
正常值 156405
Name: isDefault, dtype: int64
**********
正常值 789067
异常值 10933
Name: n2.1_outliers, dtype: int64
n2.1_outliers
异常值 3205
正常值 156405
Name: isDefault, dtype: int64
**********
正常值 788660
异常值 11340
Name: n4_outliers, dtype: int64
n4_outliers
异常值 2476
正常值 157134
Name: isDefault, dtype: int64
**********
正常值 790355
异常值 9645
Name: n5_outliers, dtype: int64
n5_outliers
异常值 1858
正常值 157752
Name: isDefault, dtype: int64
**********
正常值 786006
异常值 13994
Name: n6_outliers, dtype: int64
n6_outliers
异常值 3182
正常值 156428
Name: isDefault, dtype: int64
**********
正常值 788430
异常值 11570
Name: n7_outliers, dtype: int64
n7_outliers
异常值 2746
正常值 156864
Name: isDefault, dtype: int64
**********
正常值 789625
异常值 10375
Name: n8_outliers, dtype: int64
n8_outliers
异常值 2131
正常值 157479
Name: isDefault, dtype: int64
**********
正常值 786384
异常值 13616
Name: n9_outliers, dtype: int64
n9_outliers
异常值 3953
正常值 155657
Name: isDefault, dtype: int64
**********
正常值 788979
异常值 11021
Name: n10_outliers, dtype: int64
n10_outliers
异常值 2639
正常值 156971
Name: isDefault, dtype: int64
**********
正常值 799434
异常值 566
Name: n11_outliers, dtype: int64
n11_outliers
异常值 112
正常值 159498
Name: isDefault, dtype: int64
**********
正常值 797585
异常值 2415
Name: n12_outliers, dtype: int64
n12_outliers
异常值 545
正常值 159065
Name: isDefault, dtype: int64
**********
正常值 788907
异常值 11093
Name: n13_outliers, dtype: int64
n13_outliers
异常值 2482
正常值 157128
Name: isDefault, dtype: int64
**********
正常值 788884
异常值 11116
Name: n14_outliers, dtype: int64
n14_outliers
异常值 3364
正常值 156246
Name: isDefault, dtype: int64
**********
1 | #删除异常值对应的行 |
数据分桶
将一些连续数据比如999,99,8这种连续数据离散化来处理
1 | # 通过除法映射到间隔均匀的分箱中,每个分箱的取值范围都是loanAmnt/1000 |
特征编码
labelEncode 直接放入树模型中
1 | #label-encode:subGrade,postCode,title |
100%|██████████| 4/4 [00:03<00:00, 1.13it/s]
Label Encoding 完成
逻辑回归等模型要单独增加的特征工程
- 对特征做归一化,去除相关性高的特征
- 归一化目的是让训练过程更好更快的收敛,避免特征大吃小的问题
- 去除相关性是增加模型的可解释性,加快预测过程。
特征选择
特征选择技术可以精简掉无用的特征,以降低最终模型的复杂性,它的最终目的是得到一个简约模型,
在不降低预测准确率或对预测准确率影响不大的情况下提高计算速度。特征选择不是为了减少训练时间
(实际上,一些技术会增加总体训练时间),而是为了减少模型评分时间。
- Filter
- 方差选择法
- 相关系数法(pearson 相关系数)
- 卡方检验
- 互信息法
- 2 Wrapper (RFE)
- 递归特征消除法
- 3 Embedded
- 基于惩罚项的特征选择法
- 基于树模型的特征选择
1 | "纵向用缺失值上面的值替换缺失值" |
1 | x_train = data_train |
1 | # 当然也可以直接看图 |
<matplotlib.axes._subplots.AxesSubplot at 0x16522246898>
1 | features = [f for f in data_train.columns if f not in ['id','issueDate','isDefault'] and '_outliers' not in f] |
1 | def cv_model(clf, train_x, train_y, test_x, clf_name): |
1 | def lgb_model(x_train, y_train, x_test): |
************************************ 1 ************************************
[LightGBM] [Warning] num_threads is set with nthread=28, will be overridden by n_jobs=24. Current value: num_threads=24
[LightGBM] [Warning] Unknown parameter: silent
Training until validation scores don't improve for 200 rounds
[200] training's auc: 0.748799 valid_1's auc: 0.730081
[400] training's auc: 0.764154 valid_1's auc: 0.730891
[600] training's auc: 0.777375 valid_1's auc: 0.730927
Early stopping, best iteration is:
[439] training's auc: 0.766861 valid_1's auc: 0.731
[0.7310002011064074]
************************************ 2 ************************************
[LightGBM] [Warning] num_threads is set with nthread=28, will be overridden by n_jobs=24. Current value: num_threads=24
[LightGBM] [Warning] Unknown parameter: silent
Training until validation scores don't improve for 200 rounds
[200] training's auc: 0.748535 valid_1's auc: 0.731631
[400] training's auc: 0.764256 valid_1's auc: 0.732332
Early stopping, best iteration is:
[345] training's auc: 0.76031 valid_1's auc: 0.732483
[0.7310002011064074, 0.7324829219213177]
************************************ 3 ************************************
[LightGBM] [Warning] num_threads is set with nthread=28, will be overridden by n_jobs=24. Current value: num_threads=24
[LightGBM] [Warning] Unknown parameter: silent
Training until validation scores don't improve for 200 rounds
[200] training's auc: 0.747855 valid_1's auc: 0.73267
[400] training's auc: 0.763207 valid_1's auc: 0.733776
[600] training's auc: 0.776409 valid_1's auc: 0.734096
[800] training's auc: 0.788911 valid_1's auc: 0.733663
Early stopping, best iteration is:
[628] training's auc: 0.778126 valid_1's auc: 0.734146
[0.7310002011064074, 0.7324829219213177, 0.7341455481432986]
************************************ 4 ************************************
[LightGBM] [Warning] num_threads is set with nthread=28, will be overridden by n_jobs=24. Current value: num_threads=24
[LightGBM] [Warning] Unknown parameter: silent
Training until validation scores don't improve for 200 rounds
[200] training's auc: 0.749233 valid_1's auc: 0.727753
[400] training's auc: 0.764767 valid_1's auc: 0.728777
[600] training's auc: 0.777702 valid_1's auc: 0.728611
Early stopping, best iteration is:
[420] training's auc: 0.766087 valid_1's auc: 0.728853
[0.7310002011064074, 0.7324829219213177, 0.7341455481432986, 0.7288532795103251]
************************************ 5 ************************************
[LightGBM] [Warning] num_threads is set with nthread=28, will be overridden by n_jobs=24. Current value: num_threads=24
[LightGBM] [Warning] Unknown parameter: silent
Training until validation scores don't improve for 200 rounds
[200] training's auc: 0.748281 valid_1's auc: 0.733124
[400] training's auc: 0.763463 valid_1's auc: 0.733781
[600] training's auc: 0.776921 valid_1's auc: 0.733684
Early stopping, best iteration is:
[536] training's auc: 0.772805 valid_1's auc: 0.733891
[0.7310002011064074, 0.7324829219213177, 0.7341455481432986, 0.7288532795103251, 0.7338908945947943]
lgb_scotrainre_list: [0.7310002011064074, 0.7324829219213177, 0.7341455481432986, 0.7288532795103251, 0.7338908945947943]
lgb_score_mean: 0.7320745690552286
lgb_score_std: 0.0019639611876869616
1 | testA_result = pd.read_csv('./testA.csv') |
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
~\anaconda3\envs\tf13\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
2888 try:
-> 2889 return self._engine.get_loc(casted_key)
2890 except KeyError as err:
pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
KeyError: 'isDefault'
The above exception was the direct cause of the following exception:
KeyError Traceback (most recent call last)
<ipython-input-69-262e8275a2db> in <module>
1 testA_result = pd.read_csv('./testA.csv')
----> 2 roc_auc_score(testA_result['isDefault'].values, lgb_test)
3
~\anaconda3\envs\tf13\lib\site-packages\pandas\core\frame.py in __getitem__(self, key)
2897 if self.columns.nlevels > 1:
2898 return self._getitem_multilevel(key)
-> 2899 indexer = self.columns.get_loc(key)
2900 if is_integer(indexer):
2901 indexer = [indexer]
~\anaconda3\envs\tf13\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
2889 return self._engine.get_loc(casted_key)
2890 except KeyError as err:
-> 2891 raise KeyError(key) from err
2892
2893 if tolerance is not None:
KeyError: 'isDefault'