我正在尝试将新信息“附加”到 CSV 文件中。问题在于该信息不在数据帧结构中,而是使用正则表达式从文本中提取的信息。示例文本将是下一个:
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Etiam id diam posuere, eleifend diam at, condimentum justo. Pellentesque mollis a diam id consequat.
TITLE-SDFSD-DFDS-SFDS-01-01: This is the title 1 that
is split into two lines with a blank line in the middle
Conditions Pellentesque blandit scelerisque pellentesque. Sed nec quam purus. Quisque nec tellus sed neque accumsan lacinia sit amet sit amet tellus. Etiam venenatis nibh vel pellentesque elementum. Nullam eget tortor quam. Morbi sed leo et arcu aliquet luctus.
Opening date 15 Apr 2021
Deadline 26 Aug 2021
Indicative budget: The total indicative budget for the topic is EUR 20.00 million.
TITLE-SDFSD-DFDS-SFDS-01-02; This is the title2 in one single line
Conditions Cras egestas consectetur sapien at dignissim. Maecenas commodo purus nibh, a tempus augue vestibulum feugiat. Vestibulum dolor neque, sagittis ut tortor et, lobortis faucibus quam.
Opening date 15 March 2021
Deadline 17 Aug 2021
Indicative budget: The total indicative budget for the topic is EUR 15.00 million.
TITLE-SDFSD-DFDS-SFDS-01-03: This is the title3 that is too long and takes two lines
Conditions Cras egestas consectetur sapien at dignissim. Maecenas commodo purus nibh, a tempus augue vestibulum feugiat. Vestibulum dolor neque, sagittis ut tortor et, lobortis faucibus quam.
Opening date 15 May 2021
Deadline 26 Sep 2021
Indicative budget: The total indicative budget for the topic is EUR 5.00 million.
要提取所有信息,我必须进行多次交互才能提取我需要的信息。我知道可以只进行一次迭代,将其分割为我需要的几个组,但我很难找到一个有效的正则表达式。相反,我使用其中的几个:
import re
import csv
with open('doubt2.txt','r', encoding="utf-8") as f:
f_contents = f.read()
regexHOR =r'\n(TITLE-\S+-\d{2}-\d{2})[:|;](.*?)^Conditions'
regexOD = r'^Opening date\s+(\d{1,2} \w+ \d{4})\s*?'
regexDL =r'^Deadline\s+(\d+ \w+ \d+)'
patternHOR = re.compile(regexHOR, re.MULTILINE | re.DOTALL)
patternOD = re.compile(regexOD, re.MULTILINE | re.DOTALL)
patternDL = re.compile(regexDL, re.MULTILINE | re.DOTALL)
matchesHOR = patternHOR.finditer(f_contents)
matchesOD = patternOD.finditer(f_contents)
matchesDL = patternDL.finditer(f_contents)
marchesHOR
找到两组,而其他匹配项只是一组。获得匹配项后,我必须将其导出到 CSV 文件中,并执行以下代码:
with open("result.csv", "w",newline='') as outfile:
csvfile = csv.writer(outfile)
csvfile.writerow(['Topic ID', 'Title', 'Opening date', 'Deadline'])
for match in matchesHOR:
csvfile.writerow([match.group(1), match.group(2).replace('\n', ' '),'',''])
for match in matchesOD:
csvfile.writerow(['','',match.group(1),''])
for match in matchesDL:
csvfile.writerow(['','','',match.group(1)])
问题是当我在 matchesHOR
之后写下新的现在时正如您在此表中看到的,它把我放在下面:
欢迎任何其他评论来执行四种交互以识别多个组
最佳答案
您需要稍微重新排列一下,以便所有项目同时写入一行。这里的方法是使用 match_hor
找到每个标题的起点,然后使用它作为 match_od
的起点,而 match_od
又用作起点对于 match_dl
。
import re
import csv
with open('doubt2.txt','r', encoding="utf-8") as f:
f_contents = f.read()
regexHOR = r'\n(TITLE-\S+-\d{2}-\d{2})[:|;](.*?)^Conditions'
regexOD = r'^Opening date\s+(\d{1,2} \w+ \d{4})\s*?'
regexDL =r'^Deadline\s+(\d+ \w+ \d+)'
patternHOR = re.compile(regexHOR, re.MULTILINE | re.DOTALL)
patternOD = re.compile(regexOD, re.MULTILINE | re.DOTALL)
patternDL = re.compile(regexDL, re.MULTILINE | re.DOTALL)
with open("result.csv", "w",newline='') as outfile:
csvfile = csv.writer(outfile)
csvfile.writerow(['Topic ID', 'Title', 'Opening date', 'Deadline'])
for match_hor in patternHOR.finditer(f_contents):
code, title = [match_hor.group(1), match_hor.group(2).replace('\n', ' ')]
offset = match_hor.end()
match_od = patternOD.search(f_contents[offset:])
offset += match_od.end()
opening = match_od.group(1)
match_dl = patternDL.search(f_contents[offset:])
offset += match_dl.end()
deadline = match_dl.group(1)
csvfile.writerow([code, title.strip(), opening, deadline])
这将为您提供result.csv
,其中包含:
Topic ID,Title,Opening date,Deadline
TITLE-SDFSD-DFDS-SFDS-01-01,This is the title 1 that is split into two lines with a blank line in the middle,15 Apr 2021,26 Aug 2021
TITLE-SDFSD-DFDS-SFDS-01-02,This is the title2 in one single line,15 March 2021,17 Aug 2021
TITLE-SDFSD-DFDS-SFDS-01-03,This is the title3 that is too long and takes two lines,15 May 2021,26 Sep 2021
关于python - 如何在 CSV 文件中添加使用正则表达式找到的信息,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/66048123/