python - 迭代大型数据框的有效方法

我有一个 csv 文件，其中包含数千条公司股票数据记录。它包含以下整数字段:

low_price, high_price, volume_traded
10, 20, 45667
15, 22, 256565
41, 47, 45645
30, 39, 547343

我的要求是通过累积每个价格水平(从低到高)的交易量来根据此数据创建一个新的 csv 文件。最终结果将只有两列，如下所示:

price, total_volume_traded
10, 45667
11, 45667
12, 45667
....
....
15, 302232
etc

换句话说，最终的 csv 包含每个价格水平的一条记录(不仅是最高价/最低价，还包括中间的价格)，以及该价格水平的volume_traded 总量。

我已经完成了这个工作，但是速度非常慢且效率低下。我确信一定有更好的方法来实现这一目标。

基本上我所做的是使用嵌套循环:

首先遍历每一行。
在每一行上创建一个嵌套循环来迭代从 low_price 到 high_price 的价格范围。
检查价格是否已存在于新数据框中，如果存在，请将当前的volume_traded 添加到其中。如果不存在，请附加价格和数量(即:创建一个新行)。

下面是一些相关代码。如果有人能在效率/速度方面提出更好的方法，我将不胜感激:

df_exising = #dataframe created from existing csv
df_new = #dataframe for new Price/Volume values

for index, row in df_existing.iterrows():
    price = row['low_price']
    for i in range(row['low_price'], row['high_price']+1):
        volume = row['volume_traded']
        df_new = accumulate_volume(df_new, price, volume)
        price+=1

def accumulate_volume(df_new, price, volume):
    #If price level already exists, add volume to existing
    if df_new['Price'].loc[df_new['Price'] == price].count() > 0:
        df_new['Volume'].loc[df_new['Price'] == price] += volume
        return(df_new)
    else:
        #first occurrence of price level, add new row
        tmp = {'Price':int(price), 'Volume':volume}
        return(df_new.append(tmp, ignore_index=True))

#once the above finishes, df_new is written to the new csv file

我对为什么这么慢的猜测至少部分是因为“append”每次被调用时都会创建一个新对象，并且它被调用了很多。总共，上述代码中的嵌套循环运行了 1595653 次。

如果您有任何帮助，我将非常感激。

最佳答案

让我们暂时忘记方法论的潜在问题(想想如果 10 万股的交易价格为 50-51，而 10 万股的交易价格为 50-59，结果会是什么样子)。

以下是一组可实现您目标的带注释的步骤:

# Initialize DataFrame.
df = pd.DataFrame({'low': [10, 15, 41, 30], 
                   'high': [20, 22, 47, 39], 
                   'volume': [45667, 256565, 45645, 547343]})

# Initialize a price dictionary spanning range of potential prices.
d = {price: 0 for price in range(min(df.low), max(df.high) + 1)}

# Create helper function to add volume to given price bucket.
def add_volume(price_dict, price, volume):
    price_dict[price] += volume

# Use a nested list comprehension to call the function and populate the dictionary.
_ = [[add_volume(d, price, volume) for price in range(low, high + 1)]
      for low, high, volume in zip(df.low, df.high, df.volume)]

# Convert the dictionary to a DataFrame and output to csv.
idx = pd.Index(d.keys(), name='price')
df = pd.DataFrame(d.values(), index=idx, columns=['total_volume_traded'])
df.to_csv('output.csv')

>>> df
       total_volume_traded
price                     
10                   45667
11                   45667
12                   45667
13                   45667
14                   45667
15                  302232
16                  302232
17                  302232
18                  302232
19                  302232
20                  302232
21                  256565
22                  256565
23                       0
24                       0
25                       0
26                       0
27                       0
28                       0
29                       0
30                  547343
31                  547343
32                  547343
33                  547343
34                  547343
35                  547343
36                  547343
37                  547343
38                  547343
39                  547343
40                       0
41                   45645
42                   45645
43                   45645
44                   45645
45                   45645
46                   45645
47                   45645

关于python - 迭代大型数据框的有效方法，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/29217682/

python - 迭代大型数据框的有效方法

上一篇：python - 为什么下面的代码在python2上不起作用？

下一篇：python - 识别数组中的重复行并对另一个数组中的相应值求和