python - 基于字符，如何在新行中分隔 pandas 数据帧的每个单元格？

我有一个大的 pandas 数据框，例如 this (这是数据):

在:

df = pd.read_csv('/Users/user/Desktop/example.csv', sep = '|')
df

输出:

    ColA   ColB
0   Lemons  NaN
1   Oranges https://www.example.com#fruitN : title: Click ...
2   Tomatos NaN

在

df['ColB'][1]

输出:

'https://www.example.com#fruitN : title: Click here to show   https://www.example.com/ceuerindex.cfm?event=overview.proc&ApplNo=200465 : title: Click to view ORANGES OK AND TOMATOES FRESH (ORANGES OK; TOMATOES FRESH) : ID  #200465 : 12 Pz : TRUE : COMPANY_5    https://www.example.com/ceuerindex.cfm?event=overview.proc&ApplNo=203874 : title: Click to view ORANGES OK AND TOMATOES FRESH (ORANGES OK; TOMATOES FRESH) : ID  #203874 : 12 Pz : TRUE : COMPANY_1    https://www.example.com#fruitName : title: Click here to show   https://www.example.com/ceuerindex.cfm?event=overview.proc&ApplNo=076477 : title: Click to view TOMATOES FRESH (TOMATOES FRESH) : ID  #076477 : 12 Pz : TRUE : Company_7    https://www.example.com/ceuerindex.cfm?event=overview.proc&ApplNo=077575 : title: Click to view TOMATOES FRESH (TOMATOES FRESH) : ID  #077575 : 12 Pz : TRUE : COMPANY_2    https://www.example.com/ceuerindex.cfm?event=overview.proc&ApplNo=6538773 : title: Click to view TOMATOES FRESH (TOMATOES FRESH) : ID  #6538773 : 12 Pz : Discontinued : COMPANY_3    https://www.example.com/ceuerindex.cfm?event=overview.proc&ApplNo=090548 : title: Click to view TOMATOES FRESH (TOMATOES FRESH) : ID  #090548 : 12 Pz : TRUE : COMPANY_4    https://www.example.com/ceuerindex.cfm?event=overview.proc&ApplNo=091226 : title: Click to view TOMATOES FRESH (TOMATOES FRESH) : ID  #091226 : 12 Pz : TRUE : COMPANY_5    https://www.example.com/ceuerindex.cfm?event=overview.proc&ApplNo=091624 : title: Click to view TOMATOES FRESH (TOMATOES FRESH) : ID  #091624 : 12 Pz : TRUE : COMPANY_6    https://www.example.com/ceuerindex.cfm?event=overview.proc&ApplNo=091650 : title: Click to view TOMATOES FRESH (TOMATOES FRESH) : ID  #091650 : 12 Pz : TRUE : COMPANY_1    '

ColB 中的每个单元格都有一个换行符 (/\n)。如何在行中扩展(不丢失其名称 ColA 引用字符串)每行由新行字符分隔？像这样的事情:

ColA    | ColB
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Lemons  | NaN
Oranges | https://www.example.com#fruitN : title: Click here to show
Oranges | https://www.example.com/ceuerindex.cfm?event=overview.proc&ApplNo=200465 : title: Click to view ORANGES OK AND TOMATOES FRESH (ORANGES OK; TOMATOES FRESH) : ID  #200465 : 12 Pz : TRUE : COMPANY_5    
Oranges | https://www.example.com/ceuerindex.cfm?event=overview.proc&ApplNo=203874 : title: Click to view ORANGES OK AND TOMATOES FRESH (ORANGES OK; TOMATOES FRESH) : ID  #203874 : 12 Pz : TRUE : COMPANY_1    
Oranges | https://www.example.com#FruitName : title: Click here to show   
Oranges | https://www.example.com/ceuerindex.cfm?event=overview.proc&ApplNo=076477 : title: Click to view TOMATOES FRESH (TOMATOES FRESH) : ID  #076477 : 12 Pz : TRUE : Company_7    
Oranges | https://www.example.com/ceuerindex.cfm?event=overview.proc&ApplNo=077575 : title: Click to view TOMATOES FRESH (TOMATOES FRESH) : ID  #077575 : 12 Pz : TRUE : COMPANY_2    
Oranges | https://www.example.com/ceuerindex.cfm?event=overview.proc&ApplNo=6538773 : title: Click to view TOMATOES FRESH (TOMATOES FRESH) : ID  #6538773 : 12 Pz : Discontinued : COMPANY_3    
Oranges | https://www.example.com/ceuerindex.cfm?event=overview.proc&ApplNo=090548 : title: Click to view TOMATOES FRESH (TOMATOES FRESH) : ID  #090548 : 12 Pz : TRUE : COMPANY_4    
Oranges | https://www.example.com/ceuerindex.cfm?event=overview.proc&ApplNo=091226 : title: Click to view TOMATOES FRESH (TOMATOES FRESH) : ID  #091226 : 12 Pz : TRUE : COMPANY_5    
Oranges | https://www.example.com/ceuerindex.cfm?event=overview.proc&ApplNo=091624 : title: Click to view TOMATOES FRESH (TOMATOES FRESH) : ID  #091624 : 12 Pz : TRUE : COMPANY_6    
Oranges | https://www.example.com/ceuerindex.cfm?event=overview.proc&ApplNo=091650 : title: Click to view TOMATOES FRESH (TOMATOES FRESH) : ID  #091650 : 12 Pz : TRUE : COMPANY_1
Tomatoes| NaN

我尝试过:

df2 = pd.DataFrame(df.ColA.tolist(), index=df.ColB).stack().reset_index(level=1, drop=True).reset_index(name='ColB')[['ColA','ColB']]

并且:

df['ColB'] = df['ColB'].str.extract('\b\n\b', expand=True)
df

更新

尝试了 Abdou 的方法后，我得到了同样的结果:

在:

df1 = df.ColB.astype(str).str.split('\n(?=http)', expand=True).stack().reset_index(drop=True, level=1).to_frame()
df2 = df1.merge(df[['ColA']], how='left', right_index=True, left_index = True)
df2.columns = ['ColB', 'ColA']
print(df2[['ColA','ColB']])

输出:

          ColA                                               ColB
0       Lemons                                                nan
1  Oranges.txt  https://www.example.com#fruitN : title: Click ...
2  Tomatos.txt                                                nan

最佳答案

尝试在 ColB 上使用 .str.split 方法并将结果扩展为数据帧，您可以将其合并回主数据帧:

df1 = df.ColB.astype(str).str.split('\n(?=http)', expand=True).stack().reset_index(drop=True, level=1).to_frame()

df2 = df1.merge(df[['ColA']], how='left', right_index=True, left_index = True)

df2.columns = ['ColB', 'ColA']

print(df2[['ColA','ColB']])

#       ColA                                               ColB
# 0   Lemons                                                nan
# 1  Oranges  https://www.example.com#fruitN : title: Click ...
# 1  Oranges  https://www.example.com/ceuerindex.cfm?event=o...
# 1  Oranges  https://www.example.com/ceuerindex.cfm?event=o...
# 1  Oranges  https://www.example.com/#FruitName2 : tit...
# 1  Oranges  https://www.example.com/ceuerindex.cfm?event=o...
# 1  Oranges  https://www.example.com/ceuerindex.cfm?event=o...
# 1  Oranges  https://www.example.com/ceuerindex.cfm?event=o...
# 1  Oranges  https://www.example.com/ceuerindex.cfm?event=o...
# 1  Oranges  https://www.example.com/ceuerindex.cfm?event=o...
# 1  Oranges  https://www.example.com/ceuerindex.cfm?event=o...
# 1  Oranges  https://www.example.com/ceuerindex.cfm?event=o...
# 2  Tomatos                                                nan

请注意，用于拆分值的模式是我自己对如何拆分的解释。您可以修改它以匹配您想要的模式。

编辑:

如上所述，用于分割的模式在这里非常重要。从您的示例数据来看，值看起来是由空格而不是换行符分隔的。因此，也许您可以使用以下命令获取 df1:

df1 = df.ColB.astype(str).str.split('\s(?=http)', expand=True).stack().reset_index(drop=True, level=1).to_frame()

我希望这会有所帮助。

关于python - 基于字符，如何在新行中分隔 pandas 数据帧的每个单元格？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/43326337/

python - 基于字符，如何在新行中分隔 pandas 数据帧的每个单元格？

编辑:

上一篇：python - 根据另一个 DataFrame 的列名子集 DataFrame

下一篇：python - Ubuntu:安装 tor 浏览器并将其与 Selenium Python 一起使用