我有一个数据框,其中的列包含我需要提取的粗体字母。有53000行27列,有粗体字。
array(['Candidate initial submission',
'The Candidate Status has now been updated from <strong>CV Submitted</strong> and <strong>Feedback Pending</strong> to <strong>Client CV Review</strong> and <strong>Feedback Awaiting</strong>Candidate initial submission',
'The Candidate Status has now been updated from <strong>CV Submitted</strong> and <strong>Feedback Pending</strong> to <strong>Interview 1</strong> and <strong>Scheduled</strong> with Stage Date 02 August, 2018, 12:00 am IST - UTC +05:30'],
dtype=object)
最佳答案
使用pandas.Series.str.extractall :
import pandas as pd
lst = ['Candidate initial submission',
'The Candidate Status has now been updated from <strong>CV Submitted</strong> and <strong>Feedback Pending</strong> to <strong>Client CV Review</strong> and <strong>Feedback Awaiting</strong>Candidate initial submission',
'The Candidate Status has now been updated from <strong>CV Submitted</strong> and <strong>Feedback Pending</strong> to <strong>Interview 1</strong> and <strong>Scheduled</strong> with Stage Date 02 August, 2018, 12:00 am IST - UTC +05:30']
df = pd.DataFrame(data=lst, columns=['text'])
result = df.text.str.extractall('<strong>(.+?)</strong>')
输出
0
match
1 0 CV Submitted
1 Feedback Pending
2 Client CV Review
3 Feedback Awaiting
2 0 CV Submitted
1 Feedback Pending
2 Interview 1
3 Scheduled
正则表达式模式'<strong>(.+?)</strong>'
将匹配 <strong>
之间的所有内容和</strong>
,文字尽可能少。要了解有关正则表达式的更多信息,请参阅here .
关于python - 如何从数据框中的列中提取强标签并附加或替换该单元格?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/58640352/