For continuous independent variables: First, create 10 or 20 bins (categories / groups) for a continuous independent variable and then combine categories with similar WOE values and replace categories with WOE values. Use WOE values rather than input values in your model.
For categorical independent variables: Combine categories with similar WOE and then create new categories of an independent variable with continuous WOE values. In other words, use WOE values rather than raw categories in your model. The transformed variable will be a continuous variable with WOE values. It is same as any continuous variable.
基于WOE分箱-指导规则:
Each category (bin) should have at least 5% of the observations.
Each category (bin) should be non-zero for both non-events and events. 最好满足 If not, you can add 0.5 to the number of events and non-events in a group. $\#G_{i} + 0.5$, $\#B_{i} + 0.5$
The WOE should be distinct for each category. Similar groups should be aggregated.
For logistic regression, the WOE should be monotonic, i.e. either growing or decreasing with the groupings. (It is because logistic regression assumes there must be a linear relationship between logit function and independent variable.) 最好满足
Missing values are binned separately. 若未做缺失值预处理
Why combine categories with similar WOE? It is because the categories with similar WOE have almost same proportion of events and non-events. In other words, the behavior of both the categories is same.
information value (IV)
WOE only considers the relative risk of each bin, without considering the proportion of accounts in each bin. (各个bin占总体的比例,或者说各个bin中的绝对数量) The information value can be utilised instead to assess the relative contribution of each bin.
Information value is one of the most useful technique to select important variables in a predictive model. It helps to rank variables on the basis of their importance.
基于IV特征选择
If the IV statistic is:
Less than 0.02, then the predictor is not useful for modeling (separating the Goods from the Bads)
0.02 to 0.1, then the predictor has only a weak relationship to the Goods/Bads odds ratio
0.1 to 0.3, then the predictor has a medium strength relationship to the Goods/Bads odds ratio
0.3 to 0.5, then the predictor has a strong relationship to the Goods/Bads odds ratio.
0.5, suspicious relationship (Check once)
Information value is not an optimal feature (variable) selection method when you are building a classification model other than binary logistic regression (for eg. random forest or SVM) as conditional log odds (which we predict in a logistic regression model) is highly related to the calculation of weight of evidence. In other words, it’s designed mainly for binary logistic regression model. Also think this way - Random forest can detect non-linear relationship very well so selecting variables via Information Value and using them in random forest model might not produce the most accurate and robust predictive model.
# 由dict of Series创建 d3 = {'one': pd.Series([1., 2., 3., 4.]), 'two': pd.Series([4., 3., 2., 1.])} df3 = pd.DataFrame(d3, index=[0, 2, 4], columns=['two', 'three']) print(df3) ''' two three 0 4.0 NaN 2 2.0 NaN 4 NaN NaN '''
存取
方法
说明
pd.read_csv('foo.csv')
df.to_csv('foo.csv')
多文件合并demo
1 2 3 4 5 6 7 8 9 10
defload_data_from_csv(obj): if isinstance(obj, list) and len(obj) > 0: df_list = [] for p in obj: df_list.append(pd.read_csv(p, header=0)) all_data = pd.concat(df_list) else: all_data = pd.read_csv(obj, header=0)
returns a function whose call method uses interpolation to find the value of new points
1 2 3 4 5 6 7 8 9 10 11 12 13
from scipy import interpolate import matplotlib.pyplot as plt %matplotlib inline
x = np.arange(0, 10) y = np.exp(-x/3.0) f = interpolate.interp1d(x, y)
xnew = np.arange(0, 9, 0.1) ynew = f(xnew) # use interpolation function returned by `interp1d` print(ynew) plt.plot(x, y, 'o', xnew, ynew, '-') plt.show()