python - DataFrame 列将字典列表存储为字符串 : Parse it and build a new dataframe

我有一个巨大的数据框(200 万行)，其中某一列有一个字典列表的字符串表示(这是几个人的学校历史)。所以，我要做的是将这些数据解析为一个新的数据框(因为关系将是 1 个人到许多学校)。

但是，我的第一个选择是使用 itertuples() 遍历数据帧。太慢了!

前几行是这样的:

list_of_dicts = {
    0: '[]',
    1: "[{'name': 'USA Health', 'subject': 'Residency, Internal Medicine, 2006 - 2009'}, {'name': 'Ross University School of Medicine', 'subject': 'Class of 2005'}]",
    2: "[{'name': 'Physicians Medical Center Carraway', 'subject': 'Residency, Surgery, 1957 - 1960'}, {'name': 'Physicians Medical Center Carraway', 'subject': 'Internship, Transitional Year, 1954 - 1955'}, {'name': 'University of Alabama School of Medicine', 'subject': 'Class of 1954'}]"
}

df_dict = pd.DataFrame.from_dict(list_of_dicts, orient='index', columns=['school_history'])

我想到的是拥有一个函数，然后将其应用于数据框:

def parse_item(row):
    eval_dict = eval(row)[0]
    school_df = pd.DataFrame.from_dict(eval_dict, orient='index').T
    return school_df

df['column'].apply(lambda x: parse_item(x))

但是，我无法弄清楚如何生成比原始数据框更大的数据框(由于一个人有多所学校的情况)。

从这 3 行中，我们的想法是拥有这个数据框(有 2 行中的 5 行):

最佳答案

迭代列以使用 ast.literal_eval() 将每个字符串转换为 python 列表；结果是一个嵌套列表，可以在相同的推导式内展开。

注意将列转换为列表(通过 tolist())首先会看到一些性能提升。

from ast import literal_eval
result = pd.DataFrame([x 
                       for row in df_dict['school_history'].tolist() 
                       for x in literal_eval(row)])

要保留原始索引，而不是将列作为列表迭代，而是迭代通过调用 items() 方法创建的 zip 对象。这将返回 (index, value) 元组，其中索引可能附加到最终输出值。

ind, data = zip(*[(i, x) 
                  for i, row in df_dict['school_history'].items() 
                  for x in ast.literal_eval(row)]);
result = pd.DataFrame(data, index=ind)

关于python - DataFrame 列将字典列表存储为字符串 : Parse it and build a new dataframe，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/73793168/

python - DataFrame 列将字典列表存储为字符串 : Parse it and build a new dataframe

上一篇：c# - 为什么嵌套 if 语句可以工作，但 "&&"运算符却不能？ (字母计数脚本)

下一篇：c# - 在编译时获取非静态方法的 MethodInfo