我有许多 (4000+) 个 CSV 股票数据(日期、开盘价、最高价、最低价、收盘价),我将它们导入到各个 Pandas 数据框中以执行分析。我是 python 的新手,想为每只股票计算 12 个月的滚动 beta,我找到了一个计算滚动 beta (Python pandas calculate rolling stock beta using rolling apply to groupby object in vectorized fashion) 的帖子,但是在我下面的代码中使用时需要超过 2.5 小时!考虑到我可以在 3 分钟内在 SQL 表中运行完全相同的计算,这太慢了。
如何提高下面代码的性能以匹配 SQL 的性能?我知道 Pandas/python 有这种能力。我当前的方法循环遍历每一行,我知道这会降低性能,但我不知道有任何聚合方法可以在数据帧上执行滚动窗口 beta 计算。
注意:将 CSV 加载到单个数据框中并计算每日返回的前 2 个步骤仅需约 20 秒。我所有的 CSV 数据帧都存储在名为“FilesLoaded”的字典中,名称如“XAO”。
非常感谢您的帮助! 谢谢:)
import pandas as pd, numpy as np
import datetime
import ntpath
pd.set_option('precision',10) #Set the Decimal Point precision to DISPLAY
start_time=datetime.datetime.now()
MarketIndex = 'XAO'
period = 250
MinBetaPeriod = period
# ***********************************************************************************************
# CALC RETURNS
# ***********************************************************************************************
for File in FilesLoaded:
FilesLoaded[File]['Return'] = FilesLoaded[File]['Close'].pct_change()
# ***********************************************************************************************
# CALC BETA
# ***********************************************************************************************
def calc_beta(df):
np_array = df.values
m = np_array[:,0] # market returns are column zero from numpy array
s = np_array[:,1] # stock returns are column one from numpy array
covariance = np.cov(s,m) # Calculate covariance between stock and market
beta = covariance[0,1]/covariance[1,1]
return beta
#Build Custom "Rolling_Apply" function
def rolling_apply(df, period, func, min_periods=None):
if min_periods is None:
min_periods = period
result = pd.Series(np.nan, index=df.index)
for i in range(1, len(df)+1):
sub_df = df.iloc[max(i-period, 0):i,:]
if len(sub_df) >= min_periods:
idx = sub_df.index[-1]
result[idx] = func(sub_df)
return result
#Create empty BETA dataframe with same index as RETURNS dataframe
df_join = pd.DataFrame(index=FilesLoaded[MarketIndex].index)
df_join['market'] = FilesLoaded[MarketIndex]['Return']
df_join['stock'] = np.nan
for File in FilesLoaded:
df_join['stock'].update(FilesLoaded[File]['Return'])
df_join = df_join.replace(np.inf, np.nan) #get rid of infinite values "inf" (SQL won't take "Inf")
df_join = df_join.replace(-np.inf, np.nan)#get rid of infinite values "inf" (SQL won't take "Inf")
df_join = df_join.fillna(0) #get rid of the NaNs in the return data
FilesLoaded[File]['Beta'] = rolling_apply(df_join[['market','stock']], period, calc_beta, min_periods = MinBetaPeriod)
# ***********************************************************************************************
# CLEAN-UP
# ***********************************************************************************************
print('Run-time: {0}'.format(datetime.datetime.now() - start_time))
最佳答案
生成随机股票数据
4,000 只股票的 20 年月度数据
dates = pd.date_range('1995-12-31', periods=480, freq='M', name='Date')
stoks = pd.Index(['s{:04d}'.format(i) for i in range(4000)])
df = pd.DataFrame(np.random.rand(480, 4000), dates, stoks)
df.iloc[:5, :5]
滚动功能
返回准备应用自定义函数的 groupby 对象
参见 Source
def roll(df, w):
# stack df.values w-times shifted once at each stack
roll_array = np.dstack([df.values[i:i+w, :] for i in range(len(df.index) - w + 1)]).T
# roll_array is now a 3-D array and can be read into
# a pandas panel object
panel = pd.Panel(roll_array,
items=df.index[w-1:],
major_axis=df.columns,
minor_axis=pd.Index(range(w), name='roll'))
# convert to dataframe and pivot + groupby
# is now ready for any action normally performed
# on a groupby object
return panel.to_frame().unstack().T.groupby(level=0)
Beta 函数
使用OLS回归的封闭形式解
假设第 0 列是市场
参见 Source
def beta(df):
# first column is the market
X = df.values[:, [0]]
# prepend a column of ones for the intercept
X = np.concatenate([np.ones_like(X), X], axis=1)
# matrix algebra
b = np.linalg.pinv(X.T.dot(X)).dot(X.T).dot(df.values[:, 1:])
return pd.Series(b[1], df.columns[1:], name='Beta')
示范
rdf = roll(df, 12)
betas = rdf.apply(beta)
时间
验证
与OP比较计算
def calc_beta(df):
np_array = df.values
m = np_array[:,0] # market returns are column zero from numpy array
s = np_array[:,1] # stock returns are column one from numpy array
covariance = np.cov(s,m) # Calculate covariance between stock and market
beta = covariance[0,1]/covariance[1,1]
return beta
print(calc_beta(df.iloc[:12, :2]))
-0.311757542437
print(beta(df.iloc[:12, :2]))
s0001 -0.311758
Name: Beta, dtype: float64
注意第一个单元格
与上面经过验证的计算值相同
betas = rdf.apply(beta)
betas.iloc[:5, :5]
回复评论
具有模拟多个数据帧的完整工作示例
num_sec_dfs = 4000
cols = ['Open', 'High', 'Low', 'Close']
dfs = {'s{:04d}'.format(i): pd.DataFrame(np.random.rand(480, 4), dates, cols) for i in range(num_sec_dfs)}
market = pd.Series(np.random.rand(480), dates, name='Market')
df = pd.concat([market] + [dfs[k].Close.rename(k) for k in dfs.keys()], axis=1).sort_index(1)
betas = roll(df.pct_change().dropna(), 12).apply(beta)
for c, col in betas.iteritems():
dfs[c]['Beta'] = col
dfs['s0001'].head(20)
关于python - 许多数据帧上的高效 Python Pandas Stock Beta 计算,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/39501277/