pandas - Pandas 中的 "roundtripping"是什么？

documentation for pandas.read_excel在 index_col 参数的描述下提到了称为“roundtripping”的内容。

Missing values will be forward filled to allow roundtripping with to_excel for merged_cells=True.

我以前从未听说过这个术语，如果我搜索定义，我只能在金融背景下找到一个定义。我已经看到它在 Pandas 中合并数据帧的上下文中被引用，但我还没有找到定义。

对于上下文，这是 index_col 参数的完整描述:

index_col : int, list of int, default None

Column (0-indexed) to use as the row labels of the DataFrame. Pass None if there is no such column. If a list is passed, those columns will be combined into a MultiIndex. If a subset of data is selected with usecols, index_col is based on the subset.

Missing values will be forward filled to allow roundtripping with to_excel for merged_cells=True. To avoid forward filling the missing values use set_index after reading the data instead of index_col.

最佳答案

要大致了解往返的含义，请查看 SE 上此 post 的答案。应用于您的示例，“允许往返”的含义如下:

facilitate an easy back-and-forth between the data in an Excel file and the same data in a df. I.e. while maintaining the intended structure throughout.

往返示例

如果我们从一个有点复杂的 df 开始，其中索引和列都命名为 MultiIndices (对于构造函数，请参阅 pd.MultiIndex.from_product )，也许可以最好地看到这个想法的用处。 :

import pandas as pd
import numpy as np

df = pd.DataFrame(data=np.random.rand(4,4), 
                  columns=pd.MultiIndex.from_product([['A','B'],[1,2]],
                                                     names=['col_0','col_1']), 
                  index=pd.MultiIndex.from_product([[0,1],[1,2]], 
                                                   names=['idx_0','idx_1']))

print(df)

col_0               A                   B          
col_1               1         2         1         2
idx_0 idx_1                                        
0     1      0.952749  0.447125  0.846409  0.699479
      2      0.297437  0.813798  0.396506  0.881103
1     1      0.581273  0.881735  0.692532  0.725254
      2      0.501324  0.956084  0.643990  0.423855

如果我们现在使用 df.to_excel 和 merge_cells 默认值(即 True)将此数据写入 Excel 文件，我们最终将得到如下数据:

df.to_excel('file.xlsx')

结果:

撇开美观不谈，这里的结构非常清晰，而且确实与我们的df中的结构相同。特别注意合并的单元格。

现在，假设我们希望稍后从 Excel 文件中再次检索此数据，并且我们使用 pd.read_excel 和默认参数。有问题的是，我们最终会变得一团糟:

df = pd.read_excel('file.xlsx')
print(df)

  Unnamed: 0  col_0         A  Unnamed: 3         B  Unnamed: 5
0        NaN  col_1  1.000000    2.000000  1.000000    2.000000
1      idx_0  idx_1       NaN         NaN       NaN         NaN
2          0      1  0.952749    0.447125  0.846409    0.699479
3        NaN      2  0.297437    0.813798  0.396506    0.881103
4          1      1  0.581273    0.881735  0.692532    0.725254
5        NaN      2  0.501324    0.956084  0.643990    0.423855

让这些数据“恢复原状”将非常耗时。为了避免这样的麻烦，我们可以依靠pd.read_excel中的参数index_col和header:

df2 = pd.read_excel('file.xlsx', index_col=[0,1], header=[0,1])

print(df2)
col_0               A                   B          
col_1               1         2         1         2
idx_0 idx_1                                        
0     1      0.952749  0.447125  0.846409  0.699479
      2      0.297437  0.813798  0.396506  0.881103
1     1      0.581273  0.881735  0.692532  0.725254
      2      0.501324  0.956084  0.643990  0.423855

# check for equality
df.equals(df2)
# True

正如你所看到的，我们在这里进行了一次“往返”，index_col和header让它顺利航行!

最后两个注意事项:

(次要)pd.read_excel 的 docs 在 index_col 部分包含拼写错误:它应该读取 merge_cells=True，而不是 merged_cells=True。
header 部分缺少类似的注释(或对 index_col 处注释的引用)。这有点令人困惑。正如我们在上面看到的，两者的行为完全相同(至少就目前而言)。

关于pandas - Pandas 中的 "roundtripping"是什么？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/74270027/

pandas - Pandas 中的 "roundtripping"是什么？

上一篇：apache-kafka - 无法锁定任务 0_13 的状态目录

下一篇：vespa - 在非英语版本中保留 Vespa 中的词序