python - 折叠共享一列值的 python 数据框行

标签 python pandas dataframe pivot

我觉得必须有一种非常直接的方法来做到这一点,但我找不到。

所以,我有这个数据(注意 description 列在几个之间有共享部分):

import pandas as pd

data = {"description": ["AAAA:A", "AAAA:B", "AAAA:C", "AAAA:D", "BBBB:A", "BBBB:B"],
        "sequence": ["AAAAAAAAAAA", "AAAAAAABBBBBB", "AAAAAAAACCCCCCC", "AAAAAAAADDDDDDD",
                     "BBBBBBAAAAA", "BBBBBBBBBBBBB"]}

df = pd.DataFrame(data)
print df

#  description         sequence
#0      AAAA:A      AAAAAAAAAAA
#1      AAAA:B    AAAAAAABBBBBB
#2      AAAA:C  AAAAAAAACCCCCCC
#3      AAAA:D  AAAAAAAADDDDDDD
#4      BBBB:A      BBBBBBAAAAA
#5      BBBB:B    BBBBBBBBBBBBB

我的最终目标是将所有序列放在一起,形成一个 4 字母的描述。像这样:

#  description   sequence_A     sequence_B       sequence_C       sequence_D
#0        AAAA  AAAAAAAAAAA  AAAAAAABBBBBB  AAAAAAAACCCCCCC  AAAAAAAADDDDDDD
#1        BBBB  BBBBBBAAAAA  BBBBBBBBBBBBB              NaN              NaN

到目前为止,我已经到了这一点:

df = df.apply(lambda row: pd.Series({"description": row["description"].split(":")[0],
                                     "sequence_{}".format(row["description"].split(":")[1]): row["sequence"]}),
              axis=1)
print df

#  description   sequence_A     sequence_B       sequence_C       sequence_D
#0        AAAA  AAAAAAAAAAA            NaN              NaN              NaN
#1        AAAA          NaN  AAAAAAABBBBBB              NaN              NaN
#2        AAAA          NaN            NaN  AAAAAAAACCCCCCC              NaN
#3        AAAA          NaN            NaN              NaN  AAAAAAAADDDDDDD
#4        BBBB  BBBBBBAAAAA            NaN              NaN              NaN
#5        BBBB          NaN  BBBBBBBBBBBBB              NaN              NaN

我猜我需要 df.groupby("description") 然后再执行一步,但我遗漏了最后一点。

最佳答案

split 然后 pivot

df[['New1','New2']]=df.description.str.split(':',expand=True)
s=df[['New1','New2','sequence']]

s.pivot(*s.columns).add_prefix('sequence_')

Out[863]: 
New2   sequence_A     sequence_B       sequence_C       sequence_D
New1                                                              
AAAA  AAAAAAAAAAA  AAAAAAABBBBBB  AAAAAAAACCCCCCC  AAAAAAAADDDDDDD
BBBB  BBBBBBAAAAA  BBBBBBBBBBBBB             None             None

关于python - 折叠共享一列值的 python 数据框行,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/49391846/

相关文章:

Visual Studio 的 Python 代码生成器?

python - 有没有办法加速 tf.keras 中的嵌入层?

python - 将 ScientificPython 安装为依赖项

python - 如何使用 pandas 读取其中项目是引用的文本文件

python - pandas read_excel 同一张纸上的多个表

python - 为了适应大规模数据存储和检索,我应该做什么?

python - 重新组合 Pandas df 中的列值

python - 等效于 Python/pandas 中 R/ddply 中的转换?

pandas - Python folium GeoJSON map 不显示

python - 从列表创建数据框时出现内存错误