我的问题仅与性能有关,与语义无关。
向 df 添加新列是否会导致现有 DataFrame 中的数据被物理复制到新的内存位置(例如,以确保 DataFrame 占用连续的内存)?
# using pandas 0.18.1, python 3.5
import pandas as pd
df = pd.DataFrame({'a': range(100)})
b = pd.Series(range(100))
df['b'] = b # is this operation expensive?
# equivalently df.loc[:, 'b'] = b
我知道(通过实验,无法在文档中找到它)df['b'] = b
将在语义上创建 b
的副本,它显然需要复制基础数据。但我不知道其他列中的数据是否可以保留在原处,或者有时需要移动。
编辑:
我知道添加大量的列是expensive .我只是询问有关添加单个列的问题。
我也知道添加一个行在某些情况下需要复制数据(或者总是?——不确定)一个明显的原因是单个列中的项目必须是连续的内存。
最佳答案
根据我的实验,我认为 loc
比较慢,并且将新的 Series
与不同的索引对齐最慢:
But I have no idea if the data in the other columns can stay where it was, or need to be moved sometimes.
我认为数据没有移动,新列被添加到最后(也许这里可能有一些异常(exception),但我不知道)。
# using pandas 0.18.1, python 3.5
import pandas as pd
#len(df) = 10m
df = pd.DataFrame({'a': range(10000000)})
b = pd.Series(range(10000000))
c = pd.Series(range(10000000), index=df.index)
df['b'] = b
df.loc[:, 'c'] = b
df['d'] = c
df.loc[:, 'e'] = c
print (df)
In [36]: %timeit df['b'] = b
10 loops, best of 3: 23.5 ms per loop
In [37]: %timeit df.loc[:, 'c'] = b
The slowest run took 5.76 times longer than the fastest. This could mean that an intermediate result is being cached.
1 loop, best of 3: 40 ms per loop
In [38]: %timeit df['d'] = c
10 loops, best of 3: 22.3 ms per loop
In [39]: %timeit df.loc[:, 'e'] = c
10 loops, best of 3: 39.5 ms per loop
但是如果改变index
:
# using pandas 0.18.1, python 3.5
import pandas as pd
df = pd.DataFrame({'a': range(10000000)})
df.index = df.index + 15
b = pd.Series(range(10000000))
c = pd.Series(range(10000000), index=df.index)
df['b'] = b
df.loc[:, 'c'] = b
df['d'] = c
df.loc[:, 'e'] = c
print (df)
In [41]: %timeit df['b'] = b
1 loop, best of 3: 656 ms per loop
In [42]: %timeit df.loc[:, 'c'] = b
1 loop, best of 3: 735 ms per loop
In [43]: %timeit df['d'] = c
10 loops, best of 3: 22.4 ms per loop
In [44]: %timeit df.loc[:, 'e'] = c
10 loops, best of 3: 56.6 ms per loop
如果添加新行,它很快,我认为这取决于 Series
的长度:
In [68]: %timeit df.loc[10000015, :] = pd.Series([1,2,3,2,4], index=df.columns)
1000 loops, best of 3: 274 µs per loop
但是如果添加很多行,开销很大,我认为这是可以避免的。
关于python - 向 DataFrame 添加列是否涉及复制数据?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/37914575/