python - 使用集合创建列会复制集合 n 次

我在使用 Pandas 时遇到了这种意外行为，我不太清楚如何解释，并且在 SO 中没有找到任何相关问题。

当从列表字典创建数据框时，正如预期的那样，我们将迭代中的每个元素放入给定 key 指定的列中的新行中。 :

pd.DataFrame({'a':[1,2,3]})

   a
0  1
1  2
2  3

但是，尝试对 set 做同样的事情，产生:

pd.DataFrame({'a':{1,2,3}})

       a
0  {1, 2, 3}
1  {1, 2, 3}
2  {1, 2, 3}

因此，该集合似乎被复制到它实际包含的元素数量，即 3。

我知道为此使用集合并没有什么意义，因为集合根据定义是无序的集合。但是，我找不到这种行为背后的任何引用或解释。这是在文档中的某处指定的吗？这背后是否有我失踪的明显原因？

pd.__version__
# '1.0.0'

最佳答案

问题在 extract_index ，还有一点 sanitize_array .要提供完整的演练:

import pandas as pd
from pandas.core.internals.construction import init_dict

#pd.DataFrame({'a':{1,2,3}})
data = {'a': {1,2,3}}
index = None
columns = None
dtype = None

dict 的构造将通过此块

elif isinstance(data, dict):
    mgr = init_dict(data, index, columns, dtype=dtype)

您可以看到索引不正确:

BlockManager
Items: Index(['a'], dtype='object')
Axis 1: RangeIndex(start=0, stop=4, step=1)
ObjectBlock: slice(0, 1, 1), 1 x 4, dtype: object

这是因为 init_dict does this ，通过 arrays=[{1, 2, 3}]至 extract_index Pandas 认为集合是 list_like .这意味着它需要 length of this set as your Index length .

from pandas.core.dtypes.common import is_list_like

is_list_like({1,2,3})
#True

另一个问题是由 ndim 中的差异引起的。存储列表或集合的数组，因此底层 np.array是不同的。这是相当埋藏here

np.array({1,2,3}).ndim
#0

np.array([1,2,3]).ndim
#1

因此，该集合被视为一个“标量”，它被广播到上面指定的整个 RangeIndex 成为 array([{1, 2, 3}, {1, 2, 3}, {1, 2, 3}], dtype=object) ，而列表保持为 array([1, 2, 3])
因为它在提取索引时有问题，所以简单的解决方法是指定索引，这样它就不会通过任何这些索引。

pd.DataFrame({'a': {1,2,3}}, index=[0])
#           a
#0  {1, 2, 3}

关于python - 使用集合创建列会复制集合 n 次，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/60603112/

python - 使用集合创建列会复制集合 n 次

上一篇：templates - 从 YAML 管道到模板文件的 Azure Pipeline 动态参数

下一篇：jenkins - 在 Jenkins 中手动提升管道结果