我有一个大的 pandas 数据框,例如 this (这是数据):
在:
df = pd.read_csv('/Users/user/Desktop/example.csv', sep = '|')
df
输出:
ColA ColB
0 Lemons NaN
1 Oranges https://www.example.com#fruitN : title: Click ...
2 Tomatos NaN
在
df['ColB'][1]
输出:
'https://www.example.com#fruitN : title: Click here to show https://www.example.com/ceuerindex.cfm?event=overview.proc&ApplNo=200465 : title: Click to view ORANGES OK AND TOMATOES FRESH (ORANGES OK; TOMATOES FRESH) : ID #200465 : 12 Pz : TRUE : COMPANY_5 https://www.example.com/ceuerindex.cfm?event=overview.proc&ApplNo=203874 : title: Click to view ORANGES OK AND TOMATOES FRESH (ORANGES OK; TOMATOES FRESH) : ID #203874 : 12 Pz : TRUE : COMPANY_1 https://www.example.com#fruitName : title: Click here to show https://www.example.com/ceuerindex.cfm?event=overview.proc&ApplNo=076477 : title: Click to view TOMATOES FRESH (TOMATOES FRESH) : ID #076477 : 12 Pz : TRUE : Company_7 https://www.example.com/ceuerindex.cfm?event=overview.proc&ApplNo=077575 : title: Click to view TOMATOES FRESH (TOMATOES FRESH) : ID #077575 : 12 Pz : TRUE : COMPANY_2 https://www.example.com/ceuerindex.cfm?event=overview.proc&ApplNo=6538773 : title: Click to view TOMATOES FRESH (TOMATOES FRESH) : ID #6538773 : 12 Pz : Discontinued : COMPANY_3 https://www.example.com/ceuerindex.cfm?event=overview.proc&ApplNo=090548 : title: Click to view TOMATOES FRESH (TOMATOES FRESH) : ID #090548 : 12 Pz : TRUE : COMPANY_4 https://www.example.com/ceuerindex.cfm?event=overview.proc&ApplNo=091226 : title: Click to view TOMATOES FRESH (TOMATOES FRESH) : ID #091226 : 12 Pz : TRUE : COMPANY_5 https://www.example.com/ceuerindex.cfm?event=overview.proc&ApplNo=091624 : title: Click to view TOMATOES FRESH (TOMATOES FRESH) : ID #091624 : 12 Pz : TRUE : COMPANY_6 https://www.example.com/ceuerindex.cfm?event=overview.proc&ApplNo=091650 : title: Click to view TOMATOES FRESH (TOMATOES FRESH) : ID #091650 : 12 Pz : TRUE : COMPANY_1 '
ColB
中的每个单元格都有一个换行符 (/
\n
)。如何在行中扩展(不丢失其名称 ColA
引用字符串)每行由新行字符分隔?像这样的事情:
ColA | ColB
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Lemons | NaN
Oranges | https://www.example.com#fruitN : title: Click here to show
Oranges | https://www.example.com/ceuerindex.cfm?event=overview.proc&ApplNo=200465 : title: Click to view ORANGES OK AND TOMATOES FRESH (ORANGES OK; TOMATOES FRESH) : ID #200465 : 12 Pz : TRUE : COMPANY_5
Oranges | https://www.example.com/ceuerindex.cfm?event=overview.proc&ApplNo=203874 : title: Click to view ORANGES OK AND TOMATOES FRESH (ORANGES OK; TOMATOES FRESH) : ID #203874 : 12 Pz : TRUE : COMPANY_1
Oranges | https://www.example.com#FruitName : title: Click here to show
Oranges | https://www.example.com/ceuerindex.cfm?event=overview.proc&ApplNo=076477 : title: Click to view TOMATOES FRESH (TOMATOES FRESH) : ID #076477 : 12 Pz : TRUE : Company_7
Oranges | https://www.example.com/ceuerindex.cfm?event=overview.proc&ApplNo=077575 : title: Click to view TOMATOES FRESH (TOMATOES FRESH) : ID #077575 : 12 Pz : TRUE : COMPANY_2
Oranges | https://www.example.com/ceuerindex.cfm?event=overview.proc&ApplNo=6538773 : title: Click to view TOMATOES FRESH (TOMATOES FRESH) : ID #6538773 : 12 Pz : Discontinued : COMPANY_3
Oranges | https://www.example.com/ceuerindex.cfm?event=overview.proc&ApplNo=090548 : title: Click to view TOMATOES FRESH (TOMATOES FRESH) : ID #090548 : 12 Pz : TRUE : COMPANY_4
Oranges | https://www.example.com/ceuerindex.cfm?event=overview.proc&ApplNo=091226 : title: Click to view TOMATOES FRESH (TOMATOES FRESH) : ID #091226 : 12 Pz : TRUE : COMPANY_5
Oranges | https://www.example.com/ceuerindex.cfm?event=overview.proc&ApplNo=091624 : title: Click to view TOMATOES FRESH (TOMATOES FRESH) : ID #091624 : 12 Pz : TRUE : COMPANY_6
Oranges | https://www.example.com/ceuerindex.cfm?event=overview.proc&ApplNo=091650 : title: Click to view TOMATOES FRESH (TOMATOES FRESH) : ID #091650 : 12 Pz : TRUE : COMPANY_1
Tomatoes| NaN
我尝试过:
df2 = pd.DataFrame(df.ColA.tolist(), index=df.ColB).stack().reset_index(level=1, drop=True).reset_index(name='ColB')[['ColA','ColB']]
并且:
df['ColB'] = df['ColB'].str.extract('\b\n\b', expand=True)
df
更新
尝试了 Abdou 的方法后,我得到了同样的结果:
在:
df1 = df.ColB.astype(str).str.split('\n(?=http)', expand=True).stack().reset_index(drop=True, level=1).to_frame()
df2 = df1.merge(df[['ColA']], how='left', right_index=True, left_index = True)
df2.columns = ['ColB', 'ColA']
print(df2[['ColA','ColB']])
输出:
ColA ColB
0 Lemons nan
1 Oranges.txt https://www.example.com#fruitN : title: Click ...
2 Tomatos.txt nan
最佳答案
尝试在 ColB
上使用 .str.split
方法并将结果扩展为数据帧,您可以将其合并回主数据帧:
df1 = df.ColB.astype(str).str.split('\n(?=http)', expand=True).stack().reset_index(drop=True, level=1).to_frame()
df2 = df1.merge(df[['ColA']], how='left', right_index=True, left_index = True)
df2.columns = ['ColB', 'ColA']
print(df2[['ColA','ColB']])
# ColA ColB
# 0 Lemons nan
# 1 Oranges https://www.example.com#fruitN : title: Click ...
# 1 Oranges https://www.example.com/ceuerindex.cfm?event=o...
# 1 Oranges https://www.example.com/ceuerindex.cfm?event=o...
# 1 Oranges https://www.example.com/#FruitName2 : tit...
# 1 Oranges https://www.example.com/ceuerindex.cfm?event=o...
# 1 Oranges https://www.example.com/ceuerindex.cfm?event=o...
# 1 Oranges https://www.example.com/ceuerindex.cfm?event=o...
# 1 Oranges https://www.example.com/ceuerindex.cfm?event=o...
# 1 Oranges https://www.example.com/ceuerindex.cfm?event=o...
# 1 Oranges https://www.example.com/ceuerindex.cfm?event=o...
# 1 Oranges https://www.example.com/ceuerindex.cfm?event=o...
# 2 Tomatos nan
请注意,用于拆分值的模式是我自己对如何拆分的解释。您可以修改它以匹配您想要的模式。
编辑:
如上所述,用于分割的模式在这里非常重要。从您的示例数据来看,值看起来是由空格而不是换行符分隔的。因此,也许您可以使用以下命令获取 df1:
df1 = df.ColB.astype(str).str.split('\s(?=http)', expand=True).stack().reset_index(drop=True, level=1).to_frame()
我希望这会有所帮助。
关于python - 基于字符,如何在新行中分隔 pandas 数据帧的每个单元格?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/43326337/