2019-07-25

Data Analysis

信用评分卡

评分卡构建

通用流程

数据预处理
- 缺失值处理（填充、丢弃字段）
- 异常值处理（丢弃记录）
- 探索性数据分析EDA（各字段统计信息）
特征选择
- 分箱
- 特征选择
- 相关性分析
模型训练
- Logistic Regression
- 系数符号筛选特征：模型中各个特征的系数，如果出现了负数，说明一些特征的线性相关性较强
- 性能评估（AUC）
计算评分

分箱

等距分箱
等频分箱
基于WOE分箱
- WOE与IV

特征选择

基于IV特征选择
- WOE与IV
基于RF特征选择
获取随机森林中，各特征importance

计算评分

首先来看一下评分卡的形式：

特征名称	特征范围	得分
基准分		xxx
特征1	分箱1	aaa
	分箱2	bbb
	…
	分箱p	ppp
特征2	分箱1	mmm
	分箱2	nnn
	…
	分箱q	qqq
…	…	…

由于得分最终是由基础分、各“特征-某分箱”得分的累加，可知特征值与得分是线性关系。
又由逻辑回归，可知

$\ln(\frac{p}{1-p}) = \mathbf{\theta}^{T} \mathbf{x} = w_{0} + w_{1}x_{1} + ... + w_{0}x_{n}$

其中$p$为预测好用户概率，$\frac{p}{1-p}$为odds，$x_{i}$为第$i$维特征。
利用此线性关系，不妨定义信用评分$Score$为：

$\begin{align*} Score &= A + B \times \ln(\frac{p}{1-p})\\ &= A + B \times (w_{0} + w_{1}x_{1} + ... + w_{0}x_{n})\\ &=(A + Bw_{0}) + Bw_{1}x_{1} + Bw_{2}x_{2} + ... + Bw_{n}x_{n} \end{align*}$

$x_{1}, …, x_{n}$是各维特征的WOE编码。为了容易看出评分卡的求算，不妨将各维特征的分箱展开，则上式为：

$\begin{align*} Score = (A + Bw_{0}) &+ Bw_{1}WOE_{1,1}\sigma_{1,1} + Bw_{1}WOE_{1,2}\sigma_{1,2} + ... + Bw_{1}WOE_{1,p}\sigma_{1,p}\\ &+ Bw_{2}WOE_{2,1}\sigma_{2,1} + Bw_{2}WOE_{2,2}\sigma_{2,2} + ... + Bw_{2}WOE_{2,q}\sigma_{2,q}\\ &+ ... \end{align*}$

其中，$WOE_{1,2}$为第1个特征第2个分箱的WOE值，$\sigma_{1,2}$是一个取值为0、1的指示变量，$\sigma_{1,1}, \sigma_{1,2}, …, \sigma_{1,p}$只能有一个为1。
由此，填充评分卡。注意，评分卡的输入变量，只能是$\sigma$指示变量，即WOE要提出去。

特征名称	特征范围	得分
基准分		$A + Bw_{0}$
特征1	分箱1	$Bw_{1}WOE_{1,1}$
	分箱2	$Bw_{1}WOE_{1,2}$
	…
	分箱p	$Bw_{1}WOE_{1,p}$
特征2	分箱1	$Bw_{2}WOE_{2,1}$
	分箱2	$Bw_{2}WOE_{2,2}$
	…
	分箱q	$Bw_{2}WOE_{2,q}$
…	…	…

上面卡片中，$w$由逻辑回归模型得出，WOE由特征分箱求算，剩下的$A, B$可以看做超参数，预先定义。
由于

$Score = A + B \times \ln(odds)$

可以进行设置，给出两个场景条件：当$odds = \theta_{0}$时，$Score = P_{0}$；当$odds = 2\theta_{0}$翻倍时，$Score = P_{0} + PDO$，以此来求得$A, B$。

$\begin{align*} &P_{0} = A + B \times \ln(\theta_{0})\\ &P_{0} + PDO = A + B \times \ln(2\theta_{0}) \end{align*}$

可得:

$\begin{align*} &B = \frac{PDO}{\ln2}\\ &A = P_{0} - B\ln(\theta_{0}) \end{align*}$

一般行业规则：当$odds = \theta_{0} = 50$时，$Score = 600$；当$odds$翻倍时，$Score = P_{0} + 20$。

2019-07-24

Data Analysis

Weight of Evidence (WOE)

weight of evidence (WOE)

woe编码可对标于dummy变量，对一个类别型特征进行编码，转换成一个数值型变量。
例如，在credit scoring的case中，label是一个binary的值（Good，Bad）。则，对于某一个类别型特征（或数值型特征分箱后），其各类别的woe可定义为：

$WOE_{i} = \ln{\frac{\frac{\#G_{i}}{\#G}}{\frac{\#B_{i}}{\#B}}} = \ln{\frac{\frac{\#G_{i}}{\#B_{i}}}{\frac{\#G}{\#B}}}$

其中，$\#G_{i}$为该类别Good数量，$\#G$为总Good数量，$\#B_{i}$为该类别Bad数量，$\#B$为总Bad数量。
可以看出，$\ln$中的值在描述“本类别的G、B分布比例，与总体的G、B分布比例，是否相似”。
若本类别的分布极为特殊，则woe值较大，表明本类别具有较强的区分性；若本类别的分布与总体基本一致，则woe近似为0，表明本类别区分性较弱。

基于WOE分箱：

For continuous independent variables: First, create 10 or 20 bins (categories / groups) for a continuous independent variable and then combine categories with similar WOE values and replace categories with WOE values. Use WOE values rather than input values in your model.
For categorical independent variables: Combine categories with similar WOE and then create new categories of an independent variable with continuous WOE values. In other words, use WOE values rather than raw categories in your model. The transformed variable will be a continuous variable with WOE values. It is same as any continuous variable.

基于WOE分箱-指导规则：

Each category (bin) should have at least 5% of the observations.
Each category (bin) should be non-zero for both non-events and events. 最好满足 If not, you can add 0.5 to the number of events and non-events in a group. $\#G_{i} + 0.5$, $\#B_{i} + 0.5$
The WOE should be distinct for each category. Similar groups should be aggregated.
For logistic regression, the WOE should be monotonic, i.e. either growing or decreasing with the groupings. (It is because logistic regression assumes there must be a linear relationship between logit function and independent variable.) 最好满足
Missing values are binned separately. 若未做缺失值预处理

Why combine categories with similar WOE?
It is because the categories with similar WOE have almost same proportion of events and non-events. In other words, the behavior of both the categories is same.

information value (IV)

WOE only considers the relative risk of each bin, without considering the proportion of accounts in each bin. （各个bin占总体的比例，或者说各个bin中的绝对数量）
The information value can be utilised instead to assess the relative contribution of each bin.

$IV = \sum_{i}(\frac{\#G_{i}}{\#G} - \frac{\#B_{i}}{\#B}) \times WOE_{}i$

Information value is one of the most useful technique to select important variables in a predictive model. It helps to rank variables on the basis of their importance.

基于IV特征选择

If the IV statistic is:

Less than 0.02, then the predictor is not useful for modeling (separating the Goods from the Bads)
0.02 to 0.1, then the predictor has only a weak relationship to the Goods/Bads odds ratio
0.1 to 0.3, then the predictor has a medium strength relationship to the Goods/Bads odds ratio
0.3 to 0.5, then the predictor has a strong relationship to the Goods/Bads odds ratio.
0.5, suspicious relationship (Check once)

Information value is not an optimal feature (variable) selection method when you are building a classification model other than binary logistic regression (for eg. random forest or SVM) as conditional log odds (which we predict in a logistic regression model) is highly related to the calculation of weight of evidence. In other words, it’s designed mainly for binary logistic regression model. Also think this way - Random forest can detect non-linear relationship very well so selecting variables via Information Value and using them in random forest model might not produce the most accurate and robust predictive model.

2019-07-17

Program Development

Python Pandas

Series

Series is a one-dimensional labeled array capable of holding any data type.

方法	说明
`pd.Series(data, index)`	创建，可由`list`、`np 1darray`、`dict`等创建
	指定的`index`，需保证长度与`data`的长度对应。由`dict`创建，则可以不对应。`index`在`keys`值中设置，`keys`以外的，`value`为`NaN`。
`Series.to_numpy()`	直接转成`ndarray`

索引、切片、计算等操作，与ndarray相同

import numpy as np
import pandas as pd

# 由list创建
s1 = pd.Series(range(5))
'''
0    0
1    1
2    2
3    3
4    4
dtype: int64
'''
# 由1darray创建
s2 = pd.Series(np.arange(5, dtype=np.uint64), index=['a', 'b', 'c', 'd', 'e'])
'''
a    0
b    1
c    2
d    3
e    4
dtype: uint64
'''
# 由dict创建
s3 = pd.Series({'a':'0', 'b':'1', 'c':'2', 'd':'3', 'e':'4'}, index=['a', 'b', 'c', 'xx'])
'''
a       0
b       1
c       2
xx    NaN
dtype: object
'''

DataFrame

DataFrame is a 2-dimensional labeled data structure with columns of potentially different types.

创建

方法	说明
`pd.DataFrame(data, index, columns)`	创建，可由`np 2darray`、`dict`等创建
	指定的`index`、`columns`，需保证长度与`data`的`shape`对应。由`dict`创建，则可以不对应。`columns`在`keys`值中设置（`dict`的`values`是序列，能生成`index`），`keys`以外的，`value`为`NaN`。

# 由2darray创建
d1 = np.arange(12).reshape(3, 4)
df1 = pd.DataFrame(d1, index=[997, 998, 999], columns=['A', 'B', 'C', 'D'])
print(df1)
'''
     A  B   C   D
997  0  1   2   3
998  4  5   6   7
999  8  9  10  11
'''

# 由dict of list创建
d2 = {'one': [1., 2., 3., 4.], 'two': [4., 3., 2., 1.]}
df2 = pd.DataFrame(d2, columns=['one'])
print(df2)
'''
   one
0  1.0
1  2.0
2  3.0
3  4.0
'''

# 由dict of Series创建
d3 = {'one': pd.Series([1., 2., 3., 4.]), 'two': pd.Series([4., 3., 2., 1.])}
df3 = pd.DataFrame(d3, index=[0, 2, 4], columns=['two', 'three'])
print(df3)
'''
   two three
0  4.0   NaN
2  2.0   NaN
4  NaN   NaN
'''

存取

方法	说明
`pd.read_csv('foo.csv')`
`df.to_csv('foo.csv')`

多文件合并demo

def load_data_from_csv(obj):
    if isinstance(obj, list) and len(obj) > 0:
        df_list = []
        for p in obj:
            df_list.append(pd.read_csv(p, header=0))
        all_data = pd.concat(df_list)
    else:
        all_data = pd.read_csv(obj, header=0)

    return all_data

拼接

方法	说明
`pd.concat(df_list)`	增加行，纵向连接多个DataFrame
`pd.merge(left, right, how='inner', on=None)`	增加列，join两个DataFrame，`how`指定join方式（左右内外），`on`指定key

查看

方法	说明
`df1.head(1)`	查看头，默认5行
`df1.tail(1)`	查看尾，默认5行
`df1.index`	查看index
`df1.columns`	查看columns
	按列名称，选择列`df1[df1.columns[bool_list]]`
`df1.dtypes`	查看各列dtype
`df1.info()`	查看非空值数量等信息
`df1.describe()`	查看基本统计量等信息

# 数值类型
df1.describe().T.assign(missing_pct=df1.apply(lambda x: (len(x) - x.count()) / len(x)))
# 非数值类型
df1.select_dtypes(include=['object']).describe().T.assign(missing_pct=df1.apply(lambda x: (len(x) - x.count()) / len(x)))

获取与修改

方法	说明
获取列
`df1['A']`	获取列，根据列名获取单列
`df1[['A', 'B']]`	获取列，根据列名list获取多列
获取行
`df1[0:1]`	获取行，使用切片直接获取连续多行
`df1.loc[997]`	获取行，根据索引名称获取单行（索引名称有可能非数字）
`df1.loc[997:999]`	获取行，根据索引名称获取连续行（本方法较特殊，会包含’999’行）
`df1.loc[[997, 999]]`	获取行，根据索引名称获取跳行
`df1.iloc[0]`	获取行，根据行位置获取单行
`df1.iloc[0:2]`	获取行，根据行位置获取连续行
`df1.iloc[[0, 2]]`	获取行，根据行位置获取跳行
获取行
`df1[bool_list]`	通过生成与行数同len的布尔Series，来选择行
`df1[df1['col1'] == 10]`
`df1[df1['col1'] != 10]`
`df1[df1['col1'] > 10]`
`df1[df1['col1'].isnull()]`
`df1[df1['col1'].isin([10, 20])]`
`df1[df1['col1'] == 10 & df1['col2'].isin([10, 20])]`
获取列
`df[df.columns[bool_list]]`	通过生成与列数同len的布尔Series，来选择列名，进而选择列
`df1[df1.columns[df1.isnull().all(axis=0)]]`
`df[:, bool_list]`	通过生成与列数同len的布尔Series，来选择列
`df1.iloc[:, df1.isnull().all(axis=0).values]`
`df1.iloc[:, [df1[col].isnull().all() for col in df1.columns]]`
删除列
`del df1['A']`
删除行
`df1.drop(index_list, inplace=True)`
新增行
`df1.append(s, ignore_index=True)`	append行，`ignore_index=True`使得df1的index不会被s的影响
新增列
`df1['K'] = 'k'`	k会被广播
`df1['K'] = list`
`df1.assign(new_colname=list)`	create a new dataframe
	`df1.assign(new_colname=df['temp_c'] * 9 / 5 + 32)`
`df1.assign(new_colname=func)`	`func`被用于`df1`
	`df1.assign(new_colname=lambda x: x.temp_c * 9 / 5 + 32)`

`df1.sort_values(by='K')`	按值排序
`df1.apply(func)`	`func`被用于`df1`，得到一组`func`的计算结果
	`df1.apply(lambda x: x.max() - x.min())`

对NaN操作：

去除NaN：df1.dropna()
- axis : {0 or ‘index’, 1 or ‘columns’}, default 0此处有坑，axis操作预期与其他处不同！
  - 0, or ‘index’ : Drop rows which contain missing values.
  - 1, or ‘columns’ : Drop columns which contain missing value.
- how : {‘any’, ‘all’}, default ‘any’
  - ‘any’ : If any NA values are present, drop that row or column.
  - ‘all’ : If all values are NA, drop that row or column.
- thresh : int, optional
  - ‘any’、‘all’的折中，达到‘thresh’个NaN再去除
- subset : array-like, optional
  - Labels along other axis to consider, e.g. if you are dropping rows these would be a list of columns to include.
  - 指定考察哪些（字段）
- inplace : bool, default False
  - If True, do operation inplace and return None.
替换NaN：df1.fillna()

汇聚

groupby

例如：df1.groupby('A').sum()、df1.groupby('A').size()
By “group by” we are referring to a process involving one or more of the following steps:
- Splitting the data into groups based on some criteria
- Applying a function to each group independently
- Combining the results into a data structure

pivot_table

例如：pd.pivot_table(df1, index=['sex', 'smoker'], columns='day', values='some_metric', aggfunc=sum, margins=True)
- index：按什么key汇聚，index的值的组合，将作为维度数量
- columns：指定汇聚多列时，各列维度是什么。columns的值的组合，将作为维度数量（不指定则直接聚合values）
- values：汇聚的什么值
- aggfunc：用什么聚合函数汇聚
- margins：是否给出总计

时间索引

对于时间序列，index是时间，可以转化使用pd中时间索引类型。

生成时间索引：pd.date_range('2019-01-01', periods=72, freq='H')
将已有字段转为pandas时间格式：pd.to_datetime(col, format='%d.%m.%Y')
- 设为index：df1.set_index(col, inplace=True)，inplace原地转换DataFrame，而不会新建

2019-07-17

Program Development

Python Scipy

常用方法

FFT

import numpy as np
from scipy.fftpack import fft, ifft

a = np.arange(16)

fa = fft(a)
print(fa)
'''
[120. +0.j          -8.+40.21871594j  -8.+19.3137085j   -8.+11.9728461j
  -8. +8.j          -8. +5.3454291j   -8. +3.3137085j   -8. +1.59129894j
  -8. +0.j          -8. -1.59129894j  -8. -3.3137085j   -8. -5.3454291j
  -8. -8.j          -8.-11.9728461j   -8.-19.3137085j   -8.-40.21871594j]
'''

aa = ifft(fa)
print(aa.real)
'''
[ 0.  1.  2.  3.  4.  5.  6.  7.  8.  9. 10. 11. 12. 13. 14. 15.]
'''

线性插值

`scipy.interpolate.interp1d(x, y, kind='linear'`

returns a function whose call method uses interpolation to find the value of new points

from scipy import interpolate
import matplotlib.pyplot as plt
%matplotlib inline

x = np.arange(0, 10)
y = np.exp(-x/3.0)
f = interpolate.interp1d(x, y)

xnew = np.arange(0, 9, 0.1)
ynew = f(xnew)   # use interpolation function returned by `interp1d`
print(ynew)
plt.plot(x, y, 'o', xnew, ynew, '-')
plt.show()

2019-07-15

Program Development

Python Numpy

ndarray 对象

同类型元素的N维数组

创建

方法	说明
`np.array(object)`	由object（常见为list）创建ndarray数组对象
`np.asarray(object)`	由object（常见为list）创建ndarray数组对象
	由list创建时，二者一样；由ndarray创建时，array会新复制出一个ndarray，asarray则仍引用原来的ndarray
`np.empty(shape, dtype)`	创建一个指定形状、数据类型、未初始化的数组，shape as tuple
	注意，empty指的是未初始化，数组元素值为随机值，并不是0值
`np.zeros(shape, dtype)`	创建一个指定形状、指定数据类型，由0填充的数组
`np.ones(shape, dtype)`	创建一个指定形状、指定数据类型，由1填充的数组
`np.arange(start, stop, step, dtype)`	根据`start`、`stop`、`step`，生成一个指定类型的定步长数组
`np.linspace(start, stop, num=50, endpoint=True, retstep=False)`	根据`start`、`stop`、`num`，生成一个等差数列数组，`endpoint`表明是否包含`stop`值，`retstep`表明结果是否返回公差
`np.logspace(start, stop, num=50, endpoint=True, base=10.0)`	根据`start`、`stop`、`num`，生成一个等比数列数组，序列起始值为`base start`，序列的终止值为：`base stop`
`np.random.rand(d0, d1, ..., dn)`	随机值`[0, 1)`序列，`d`给出shape
`ndarray.copy()`	深拷贝

属性

方法	说明
`ndarray.shape`	数组各维度的大小（是数量不是阶数）
`ndarray.size`	数组元素的总个数（等于shape中各值相乘）
`ndarray.dtype`	数组的元素类型
`ndarray.ndim`	轴的数量，或维度的数量rank。轴编号从`0`到`ndim-1`

轴的概念

很多操作可以声明axis。axis=i，代表沿着第i个轴（下标变化）的方向，进行操作。
例如，对于二维数组：axis=0表示沿着第0轴进行操作，即固定其他轴，遍历对第一个下标，这样相当于每次处理了一列的数据，即按列操作；axis=1表示沿着第1轴进行操作，即按行操作。

更加具体的例子：

arr = np.arange(16).reshape(2, 4, 2)
'''
array([[[ 0,  1],
        [ 2,  3],
        [ 4,  5],
        [ 6,  7]],

       [[ 8,  9],
        [10, 11],
        [12, 13],
        [14, 15]]])
'''
arr.sum(axis=0)
'''
array([[ 8, 10],
       [12, 14],
       [16, 18],
       [20, 22]])
'''

例子中，arr的shape为(2, 4, 2)，arr的shape的下标为(0, 1, 2)。
axis=0，即对shape下标的第一个位置进行处理。由(2, 4, 2)可知，每次处理2个，固定(4, 2)作为最后结果shape。
求算时，arr[0][a][b]+arr[1][a][b]作为结果s[a][b]的值。

切片

注意：切片数组返回一个view，而非copy；这与python list不同，python list的切片是copy

方法	说明
`arr[idx]`	一维索引
`arr[start:stop:step]`	一维，切片索引
`arr[start:]`	一维，start直至最后
`arr[np.array([(idx1, idx2), (idx3, idx4)])]`	通过一维生成二维
`arr[1]`	二维，第二行
`arr[1, 1]`	二维，第二行、第二列
`arr[:, 1]`	二维，所有行、第二列
`arr[-1, :]`	二维，最后行、所有列
`arr[..., 1]`	二维，所有行、第二列
	对于多维，如3darray，`arr[..., 1] == arr[:, :, 1]`
`arr[..., 1:]`	二维，所有行、第二至最后列
`arr[0:1, 1:2]`	二维，一至二行、二至三列
`arr[[0,1,2], [0,1,0]]`	二维，数组索引，分别指定idx_x、idx_y。即，`arr[(0, 0), (1, 1), (2, 0)]`
`arr[arr > 5]`	布尔索引，其中`arr > 5`会给出一个和arr同shape的布尔数组，按这个布尔数组索引
`arr[[4, 2, 1, 7]]`	花式索引，索引为数组。如果目标是一维数组，那么索引的结果就是对应位置的元素；如果目标是二维数组，那么就是对应下标的行。

`np.where(condition)`	`np.asarray(condition).nonzero()`，False作为0。由于`nonzero`返回的是索引，选择时，使用`arr[np.where(condition)]`
`np.where(condition, x, y)`	满足条件(condition)，输出x（对应位置内容），不满足输出y（对应位置内容）。相当于对`zip(condition, x, y)`进行选择。`zip`将各个对象打包（同idx的在同一批处理），如果各个迭代器的元素个数不一致，则返回列表长度与最短的对象相同。`zip(*)`解压出各个对象

数字后的:表示范围；单独的:表示所有

调整形状

方法	说明
`ndarray.reshape()`	调整数组shape，返回view。shape中-1表示（根据其他维度设置）自动决定数量
`np.resize(arr, shape)`	调整数组shape，返回copy
`ndarray.flat`	返回数组元素迭代器（可直接索引`arr.flat[3]`）
`ndarray.ravel()`	返回数组flat后的view
`ndarray.flatten()`	返回数组flat后的copy（默认参数为”C”，即按照行flat；设置参数为”F”，可按照列falt）
`ndarray.T`	转置数组
`np.transpose(arr, axes)`	转置数组，axes指定转置后的维度下标顺序。如3darray，默认axes为[0, 1, 2]，当指定为[1, 0, 2]时，所有元素第一下标和第二下标互换位置
`np.expand_dims(arr, axis)`	扩展维度，`axis`为新轴插入的位置。例如，一个shape为(2, 2)的数组，经过`axis=0`扩展，新shape为(1, 2, 2)
`np.squeeze(arr, axis)`	从给定数组，把shape中为1的维度去掉。例如，shape(1, 2, 2) -> shape(2, 2)。可通过`axis`参数指定需要删除的维度，但是指定的维度必须为单维度，否则将会报错

拼接

方法	说明
`np.concatenate((a1, a2, ...), axis)`	`axis=0`增加行；`axis=1`增加列；`axis=None`获得flat
`np.vstack(tup)`	增加行，入参为tuple，例如`(a1, a2, ...)`；vertical方向
`np.hstack(tup)`	增加列；horizontal方向
`np.r_[ar1, ar2]`	增加行，row
`np.c_[ar1, ar2]`	增加列，column
`np.dstack(tup)`	会拓展维度，增加维度（深度），depth
`np.stack(arrays, axis)`	会拓展维度，在新增维度上堆叠。`axis`指定新增哪个维度，`axis=0`新增第一个维度；`axis=-1`新增最后一个维度（dstack）
	或者用`np.expand_dims`拓展维度`axis=0`后，对行进行拼接，然后`transpose`

demo

a = np.zeros((2, 3), dtype=np.uint64)
'''
[[0 0 0]
 [0 0 0]]
'''
b = np.ones((2, 3), dtype=np.uint64)
'''
[[1 1 1]
 [1 1 1]]
'''
# a、b的shape可以不完全相同，但需要在待拼接的维度中一致

c1 = np.concatenate((a, b))
c11 = np.vstack((a, b)) #np.r_
'''
[[0 0 0]
 [0 0 0]
 [1 1 1]
 [1 1 1]]
'''
c2 = np.concatenate((a, b), axis=1)
c22 = np.hstack((a, b)) #np.c_
'''
[[0 0 0 1 1 1]
 [0 0 0 1 1 1]]
'''

# np.stack会新创建一个维度，位置由axis指定；而上文三个方法不会新增维度
cc1 = np.stack((a, b))
'''
[[[0 0 0]
  [0 0 0]]

 [[1 1 1]
  [1 1 1]]]
'''
print(cc1.shape)
# (2, 2, 3)

# 深度方向堆叠
cc2 = np.stack((a, b), axis=2) #np.dstack()
'''
[[[0 1]
  [0 1]
  [0 1]]

 [[0 1]
  [0 1]
  [0 1]]]
'''
print(cc2.shape)
# (2, 3, 2)

增删行列

方法	说明
`np.append(arr, values, axis=None)`	末尾添加，`axis`可指定添加行、列，shape需保证对齐
`np.insert(arr, obj, values, axis)`	在指定索引标号`obj`之前，沿`axis`指定轴，向数组中插入值`values`，shape需保证对齐
`np.delete(arr, obj, axis)`	在指定索引标号`obj`，沿`axis`指定轴，删除数组中（某行、列）数据
`np.unique(arr, return_index=False, return_inverse=False, return_counts=False)`	去除数组中的重复元素

计算函数

方法	说明
`np.around(a, decimals)`	四舍五入，`decimals`指定小数点后位数；负值表示round到十位、百位…
`np.floor(a)`	取地板
`np.ceil(a)`	取天花板

`np.reciprocal(a)`	取倒数
`np.power(a, b)`	幂
`np.mod(a)`	取余

`np.amin(a)` `np.nanmin(a)`	ignoring any NaNs
`np.amax(a)` `np.nanmax(a)`
`np.sum(a)` `np.nansum(a)`
`np.percentile(a, q)` `np.nanpercentile(a, q)`	q在0~100之间
`np.quantile(a, q)` `np.nanquantile(a, q)`	q在0~1之间
`np.median(a)` `np.nanmedian(a)`
`np.mean(a)` `np.nanmean(a)`
`np.average(a, weights=arr)`	可指定权重，计算加权平均
`np.std(a)` `np.nanstd(a)`
`np.var(a)` `np.nanvar(a)`

`np.dot(a, b)`	矩阵乘法，得到矩阵
`np.inner(a)`	向量内积（对位相乘后相加），得到一个数

排序与索引查询

方法	说明
`np.sort(a)`
`np.argsort(a)`	从小到大的排序后数组，对应a的索引值
`np.argmax(a)`	a中最大元素的索引
`np.argmin(a)`	a中最小元素的索引
`np.nonzero(a)`	a中非0元素的索引
`np.where(condition)`	`np.asarray(condition).nonzero()`，给出符合条件的索引（a包含在条件中）

ChenyuShuxin

晨雨舒心

信用评分卡

评分卡构建

通用流程

分箱

特征选择

计算评分

Weight of Evidence (WOE)

weight of evidence (WOE)

information value (IV)

Python Pandas

Series

DataFrame

创建

存取

拼接

查看

获取与修改

汇聚

时间索引

Python Scipy

常用方法

FFT

线性插值

`scipy.interpolate.interp1d(x, y, kind='linear'`

Python Numpy

ndarray 对象

创建

属性

切片

调整形状

拼接

增删行列

计算函数

排序与索引查询