python - 基于字符,如何在新行中分隔 pandas 数据帧的每个单元格?

标签 python python-3.x pandas

我有一个大的 pandas 数据框,例如 this (这是数据):

在:

df = pd.read_csv('/Users/user/Desktop/example.csv', sep = '|')
df

输出:

    ColA   ColB
0   Lemons  NaN
1   Oranges https://www.example.com#fruitN : title: Click ...
2   Tomatos NaN

df['ColB'][1]

输出:

'https://www.example.com#fruitN : title: Click here to show   https://www.example.com/ceuerindex.cfm?event=overview.proc&ApplNo=200465 : title: Click to view ORANGES OK AND TOMATOES FRESH (ORANGES OK; TOMATOES FRESH) : ID  #200465 : 12 Pz : TRUE : COMPANY_5    https://www.example.com/ceuerindex.cfm?event=overview.proc&ApplNo=203874 : title: Click to view ORANGES OK AND TOMATOES FRESH (ORANGES OK; TOMATOES FRESH) : ID  #203874 : 12 Pz : TRUE : COMPANY_1    https://www.example.com#fruitName : title: Click here to show   https://www.example.com/ceuerindex.cfm?event=overview.proc&ApplNo=076477 : title: Click to view TOMATOES FRESH (TOMATOES FRESH) : ID  #076477 : 12 Pz : TRUE : Company_7    https://www.example.com/ceuerindex.cfm?event=overview.proc&ApplNo=077575 : title: Click to view TOMATOES FRESH (TOMATOES FRESH) : ID  #077575 : 12 Pz : TRUE : COMPANY_2    https://www.example.com/ceuerindex.cfm?event=overview.proc&ApplNo=6538773 : title: Click to view TOMATOES FRESH (TOMATOES FRESH) : ID  #6538773 : 12 Pz : Discontinued : COMPANY_3    https://www.example.com/ceuerindex.cfm?event=overview.proc&ApplNo=090548 : title: Click to view TOMATOES FRESH (TOMATOES FRESH) : ID  #090548 : 12 Pz : TRUE : COMPANY_4    https://www.example.com/ceuerindex.cfm?event=overview.proc&ApplNo=091226 : title: Click to view TOMATOES FRESH (TOMATOES FRESH) : ID  #091226 : 12 Pz : TRUE : COMPANY_5    https://www.example.com/ceuerindex.cfm?event=overview.proc&ApplNo=091624 : title: Click to view TOMATOES FRESH (TOMATOES FRESH) : ID  #091624 : 12 Pz : TRUE : COMPANY_6    https://www.example.com/ceuerindex.cfm?event=overview.proc&ApplNo=091650 : title: Click to view TOMATOES FRESH (TOMATOES FRESH) : ID  #091650 : 12 Pz : TRUE : COMPANY_1    '

ColB 中的每个单元格都有一个换行符 (/\n)。如何在行中扩展(不丢失其名称 ColA 引用字符串)每行由新行字符分隔?像这样的事情:

ColA    | ColB
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Lemons  | NaN
Oranges | https://www.example.com#fruitN : title: Click here to show
Oranges | https://www.example.com/ceuerindex.cfm?event=overview.proc&ApplNo=200465 : title: Click to view ORANGES OK AND TOMATOES FRESH (ORANGES OK; TOMATOES FRESH) : ID  #200465 : 12 Pz : TRUE : COMPANY_5    
Oranges | https://www.example.com/ceuerindex.cfm?event=overview.proc&ApplNo=203874 : title: Click to view ORANGES OK AND TOMATOES FRESH (ORANGES OK; TOMATOES FRESH) : ID  #203874 : 12 Pz : TRUE : COMPANY_1    
Oranges | https://www.example.com#FruitName : title: Click here to show   
Oranges | https://www.example.com/ceuerindex.cfm?event=overview.proc&ApplNo=076477 : title: Click to view TOMATOES FRESH (TOMATOES FRESH) : ID  #076477 : 12 Pz : TRUE : Company_7    
Oranges | https://www.example.com/ceuerindex.cfm?event=overview.proc&ApplNo=077575 : title: Click to view TOMATOES FRESH (TOMATOES FRESH) : ID  #077575 : 12 Pz : TRUE : COMPANY_2    
Oranges | https://www.example.com/ceuerindex.cfm?event=overview.proc&ApplNo=6538773 : title: Click to view TOMATOES FRESH (TOMATOES FRESH) : ID  #6538773 : 12 Pz : Discontinued : COMPANY_3    
Oranges | https://www.example.com/ceuerindex.cfm?event=overview.proc&ApplNo=090548 : title: Click to view TOMATOES FRESH (TOMATOES FRESH) : ID  #090548 : 12 Pz : TRUE : COMPANY_4    
Oranges | https://www.example.com/ceuerindex.cfm?event=overview.proc&ApplNo=091226 : title: Click to view TOMATOES FRESH (TOMATOES FRESH) : ID  #091226 : 12 Pz : TRUE : COMPANY_5    
Oranges | https://www.example.com/ceuerindex.cfm?event=overview.proc&ApplNo=091624 : title: Click to view TOMATOES FRESH (TOMATOES FRESH) : ID  #091624 : 12 Pz : TRUE : COMPANY_6    
Oranges | https://www.example.com/ceuerindex.cfm?event=overview.proc&ApplNo=091650 : title: Click to view TOMATOES FRESH (TOMATOES FRESH) : ID  #091650 : 12 Pz : TRUE : COMPANY_1
Tomatoes| NaN

我尝试过:

df2 = pd.DataFrame(df.ColA.tolist(), index=df.ColB).stack().reset_index(level=1, drop=True).reset_index(name='ColB')[['ColA','ColB']]

并且:

df['ColB'] = df['ColB'].str.extract('\b\n\b', expand=True)
df

更新

尝试了 Abdou 的方法后,我得到了同样的结果:

在:

df1 = df.ColB.astype(str).str.split('\n(?=http)', expand=True).stack().reset_index(drop=True, level=1).to_frame()
df2 = df1.merge(df[['ColA']], how='left', right_index=True, left_index = True)
df2.columns = ['ColB', 'ColA']
print(df2[['ColA','ColB']])

输出:

          ColA                                               ColB
0       Lemons                                                nan
1  Oranges.txt  https://www.example.com#fruitN : title: Click ...
2  Tomatos.txt                                                nan

最佳答案

尝试在 ColB 上使用 .str.split 方法并将结果扩展为数据帧,您可以将其合并回主数据帧:

df1 = df.ColB.astype(str).str.split('\n(?=http)', expand=True).stack().reset_index(drop=True, level=1).to_frame()

df2 = df1.merge(df[['ColA']], how='left', right_index=True, left_index = True)

df2.columns = ['ColB', 'ColA']

print(df2[['ColA','ColB']])

#       ColA                                               ColB
# 0   Lemons                                                nan
# 1  Oranges  https://www.example.com#fruitN : title: Click ...
# 1  Oranges  https://www.example.com/ceuerindex.cfm?event=o...
# 1  Oranges  https://www.example.com/ceuerindex.cfm?event=o...
# 1  Oranges  https://www.example.com/#FruitName2 : tit...
# 1  Oranges  https://www.example.com/ceuerindex.cfm?event=o...
# 1  Oranges  https://www.example.com/ceuerindex.cfm?event=o...
# 1  Oranges  https://www.example.com/ceuerindex.cfm?event=o...
# 1  Oranges  https://www.example.com/ceuerindex.cfm?event=o...
# 1  Oranges  https://www.example.com/ceuerindex.cfm?event=o...
# 1  Oranges  https://www.example.com/ceuerindex.cfm?event=o...
# 1  Oranges  https://www.example.com/ceuerindex.cfm?event=o...
# 2  Tomatos                                                nan

请注意,用于拆分值的模式是我自己对如何拆分的解释。您可以修改它以匹配您想要的模式。

编辑:

如上所述,用于分割的模式在这里非常重要。从您的示例数据来看,值看起来是由空格而不是换行符分隔的。因此,也许您可​​以使用以下命令获取 df1:

df1 = df.ColB.astype(str).str.split('\s(?=http)', expand=True).stack().reset_index(drop=True, level=1).to_frame()

我希望这会有所帮助。

关于python - 基于字符,如何在新行中分隔 pandas 数据帧的每个单元格?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/43326337/

相关文章:

python - 重新索引多索引的问题

python - Pandas 数据框根据索引和列值删除行

python-3.x - Networkx:使用端口连接节点

python - TemplateDoesNotExist 但它存在

python - 使用 df.resample 时如何使 NaN 值总和为 NaN 而不是 0?

python - 通过插值减去 Pandas 中具有不规则和规则时间戳的两个系列

pandas - 如何使用 pandas 根据一行标识符对合并列进行分组?

python - 如何在 Python - GEKKO 中构建和打印循环生成的优化值列表?

python - django Google-Oauth 身份验证错误

python - 在运行时附加 __call__ 不起作用