Pandas必备技能之“时间序列数据处理”

时间序列数据Time Series Data是在不同时间上收集到的数据，这类数据是按时间顺序收集到的，用于所描述现象随时间变化的情况。

时间序列分析广泛应用于计量经济学模型中，通过寻找历史数据中某一现象的发展规律，对未来进行预测。

时间序列数据作为时间序列分析的基础，学会如何对它进行巧妙地处理是非常必要的，Python中的Pandas库为我们提供了强大的时间序列数据处理的方法，本文会介绍其中常用的几个。

【工具】

Python 3
Tushare

01、时间格式转换

有时候，我们获得的原始数据并不是按照时间类型索引进行排列的，需要先进行时间格式的转换，为后续的操作和分析做准备。

这里介绍两种方法。第一种方法是用pandas.read_csv导入文件的时候，通过设置参数parse_dates和index_col，直接对日期列进行转换，并将其设置为索引。关于参数的详细解释，请查看文档【1】。

如下示例中，在没有设置参数之前，可以观察到数据集中的索引是数字0-208，'date'列的数据类型也不是日期。

In [8]: data = pd.read_csv('unemployment.csv')
In [9]: data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 209 entries, 0 to 208
Data columns (total 2 columns):
date 209 non-null object
UNRATE 209 non-null float64
dtypes: float64(1), object(1)
memory usage: 3.3+ KB

设置参数parse_dates = ['date'] ，将数据类型转换成日期，再设置 index_col = 'date'，将这一列用作索引，结果如下。

In [11]: data = pd.read_csv('unemployment.csv', parse_dates=['date'], index_col='date')
In [12]: data.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 209 entries, 2000-01-01 to 2017-05-01
Data columns (total 1 columns):
UNRATE 209 non-null float64
dtypes: float64(1)
memory usage: 13.3 KB

这时，索引变成了日期'20000101'-'2017-05-01'，数据类型是datetime。

第二种方法是在已经导入数据的情况下，用pd.to_datetime()【2】将列转换成日期类型，再用 df.set_index()【3】将其设置为索引，完成转换。

以tushare.pro上面的日线行情数据为例，我们把'trade_date'列转换成日期类型，并设置成索引。

import tushare as ts
import pandas as pd
pd.set_option('expand_frame_repr', False) # 列太多时不换行
pro = ts.pro_api()
df = pro.daily(ts_code='000001.SZ', start_date='20180701', end_date='20180718')
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13 entries, 0 to 12
Data columns (total 11 columns):
ts_code 13 non-null object
trade_date 13 non-null object
open 13 non-null float64
high 13 non-null float64
low 13 non-null float64
close 13 non-null float64
pre_close 13 non-null float64
change 13 non-null float64
pct_chg 13 non-null float64
vol 13 non-null float64
amount 13 non-null float64
dtypes: float64(9), object(2)
memory usage: 1.2+ KB
None
df['trade_date'] = pd.to_datetime(df['trade_date'])
df.set_index('trade_date', inplace=True)
df.sort_values('trade_date', ascending=True, inplace=True) # 升序排列
df.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 13 entries, 2018-07-02 to 2018-07-18
Data columns (total 10 columns):
ts_code 13 non-null object
open 13 non-null float64
high 13 non-null float64
low 13 non-null float64
close 13 non-null float64
pre_close 13 non-null float64
change 13 non-null float64
pct_chg 13 non-null float64
vol 13 non-null float64
amount 13 non-null float64
dtypes: float64(9), object(1)
memory usage: 1.1+ KB

打印出前5行，效果如下。

df.head()
Out[15]:
ts_code open high low close pre_close change pct_chg vol amount
trade_date
2018-07-02 000001.SZ 9.05 9.05 8.55 8.61 9.09 -0.48 -5.28 1315520.13 1158545.868
2018-07-03 000001.SZ 8.69 8.70 8.45 8.67 8.61 0.06 0.70 1274838.57 1096657.033
2018-07-04 000001.SZ 8.63 8.75 8.61 8.61 8.67 -0.06 -0.69 711153.37 617278.559
2018-07-05 000001.SZ 8.62 8.73 8.55 8.60 8.61 -0.01 -0.12 835768.77 722169.579
2018-07-06 000001.SZ 8.61 8.78 8.45 8.66 8.60 0.06 0.70 988282.69 852071.526

02、时间周期转换

在完成时间格式转换之后，我们就可以进行后续的日期操作了。下面介绍一下如何对时间序列数据进行重采样resampling。

重采样指的是将时间序列从⼀个频率转换到另⼀个频率的处理过程。将⾼频率数据聚合到低频率称为降采样downsampling，如将股票的日线数据转换成周线数据，⽽将低频率数据转换到⾼频率则称为升采样upsampling，如将股票的周线数据转换成日线数据。

降采样：以日线数据转换周线数据为例。继续使用上面的tushare.pro日线行情数据，选出特定的几列。

df = df[['ts_code', 'open', 'high', 'low', 'close', 'vol']] # 单位：成交量（手）
ts_code open high low close vol
trade_date
2018-07-02 000001.SZ 9.05 9.05 8.55 8.61 1315520.13
2018-07-03 000001.SZ 8.69 8.70 8.45 8.67 1274838.57
2018-07-04 000001.SZ 8.63 8.75 8.61 8.61 711153.37
2018-07-05 000001.SZ 8.62 8.73 8.55 8.60 835768.77
2018-07-06 000001.SZ 8.61 8.78 8.45 8.66 988282.69
2018-07-09 000001.SZ 8.69 9.03 8.68 9.03 1409954.60
2018-07-10 000001.SZ 9.02 9.02 8.89 8.98 896862.02
2018-07-11 000001.SZ 8.76 8.83 8.68 8.78 851296.70
2018-07-12 000001.SZ 8.60 8.97 8.58 8.88 1140492.31
2018-07-13 000001.SZ 8.92 8.94 8.82 8.88 603378.21
2018-07-16 000001.SZ 8.85 8.90 8.69 8.73 689845.58
2018-07-17 000001.SZ 8.74 8.75 8.66 8.72 375356.33
2018-07-18 000001.SZ 8.75 8.85 8.69 8.70 525152.77

为了方便大家观察，把这段时间的日历附在下面，'2018-07-02'正好是星期一。

转换的思路是这样的，以日历中的周进行聚合，如'20180702'-'20180708'，取该周期内，日线开盘价的第一个值作为周开盘价，日线最高价的最大值作为周最高价，日线最低价的最小值作为周最低价，日线收盘价的最后一个值作为周最收盘价，日线最高价的最大值作为周最高价，日线成交量的求和作为周成交量(手)，如下图黄色方框所示。