python - 如何在 CSV 文件中添加使用正则表达式找到的信息

标签 python regex csv regex-group

我正在尝试将新信息“附加”到 CSV 文件中。问题在于该信息不在数据帧结构中,而是使用正则表达式从文本中提取的信息。示例文本将是下一个:

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Etiam id diam posuere, eleifend diam at, condimentum justo. Pellentesque mollis a diam id consequat.

TITLE-SDFSD-DFDS-SFDS-01-01: This is the title 1 that

is split into two lines with a blank line in the middle

Conditions Pellentesque blandit scelerisque pellentesque. Sed nec quam purus. Quisque nec tellus sed neque accumsan lacinia sit amet sit amet tellus. Etiam venenatis nibh vel pellentesque elementum. Nullam eget tortor quam. Morbi sed leo et arcu aliquet luctus.

Opening date 15 Apr 2021

Deadline 26 Aug 2021

Indicative budget: The total indicative budget for the topic is EUR 20.00 million.

TITLE-SDFSD-DFDS-SFDS-01-02; This is the title2 in one single line

Conditions Cras egestas consectetur sapien at dignissim. Maecenas commodo purus nibh, a tempus augue vestibulum feugiat. Vestibulum dolor neque, sagittis ut tortor et, lobortis faucibus quam.

Opening date 15 March 2021

Deadline 17 Aug 2021

Indicative budget: The total indicative budget for the topic is EUR 15.00 million.

TITLE-SDFSD-DFDS-SFDS-01-03: This is the title3 that is too long and takes two lines

Conditions Cras egestas consectetur sapien at dignissim. Maecenas commodo purus nibh, a tempus augue vestibulum feugiat. Vestibulum dolor neque, sagittis ut tortor et, lobortis faucibus quam.

Opening date 15 May 2021

Deadline 26 Sep 2021

Indicative budget: The total indicative budget for the topic is EUR 5.00 million.

要提取所有信息,我必须进行多次交互才能提取我需要的信息。我知道可以只进行一次迭代,将其分割为我需要的几个组,但我很难找到一个有效的正则表达式。相反,我使用其中的几个:

import re
import csv
    
with open('doubt2.txt','r', encoding="utf-8") as f:
    f_contents = f.read()

regexHOR =r'\n(TITLE-\S+-\d{2}-\d{2})[:|;](.*?)^Conditions'
regexOD = r'^Opening date\s+(\d{1,2} \w+ \d{4})\s*?'
regexDL =r'^Deadline\s+(\d+ \w+ \d+)'

patternHOR = re.compile(regexHOR, re.MULTILINE | re.DOTALL)
patternOD = re.compile(regexOD, re.MULTILINE | re.DOTALL)
patternDL = re.compile(regexDL, re.MULTILINE | re.DOTALL)

matchesHOR = patternHOR.finditer(f_contents)
matchesOD = patternOD.finditer(f_contents)
matchesDL = patternDL.finditer(f_contents)

marchesHOR找到两组,而其他匹配项只是一组。获得匹配项后,我必须将其导出到 CSV 文件中,并执行以下代码:

with open("result.csv", "w",newline='') as outfile:
    csvfile = csv.writer(outfile)
    csvfile.writerow(['Topic ID', 'Title', 'Opening date', 'Deadline'])
    for match in matchesHOR:
        csvfile.writerow([match.group(1), match.group(2).replace('\n', ' '),'',''])
    for match in matchesOD:
        csvfile.writerow(['','',match.group(1),''])
    for match in matchesDL:
        csvfile.writerow(['','','',match.group(1)])

问题是当我在 matchesHOR 之后写下新的现在时正如您在此表中看到的,它把我放在下面:

<表类=“s-表”> <标题> 代码 标题 开幕 截止日期 <正文> 代码1 标题 1 代码2 标题 2 代码3 标题 3 开场 1 开场2 开场3 截止日期 1 截止日期 2 截止日期 3

欢迎任何其他评论来执行四种交互以识别多个组

最佳答案

您需要稍微重新排列一下,以便所有项目同时写入一行。这里的方法是使用 match_hor 找到每个标题的起点,然后使用它作为 match_od 的起点,而 match_od 又用作起点对于 match_dl

import re
import csv
    
with open('doubt2.txt','r', encoding="utf-8") as f:
    f_contents = f.read()

regexHOR = r'\n(TITLE-\S+-\d{2}-\d{2})[:|;](.*?)^Conditions'
regexOD = r'^Opening date\s+(\d{1,2} \w+ \d{4})\s*?'
regexDL =r'^Deadline\s+(\d+ \w+ \d+)'

patternHOR = re.compile(regexHOR, re.MULTILINE | re.DOTALL)
patternOD = re.compile(regexOD, re.MULTILINE | re.DOTALL)
patternDL = re.compile(regexDL, re.MULTILINE | re.DOTALL)

with open("result.csv", "w",newline='') as outfile:
    csvfile = csv.writer(outfile)
    csvfile.writerow(['Topic ID', 'Title', 'Opening date', 'Deadline'])
    
    for match_hor in patternHOR.finditer(f_contents):
        code, title = [match_hor.group(1), match_hor.group(2).replace('\n', ' ')]
        offset = match_hor.end()
        
        match_od = patternOD.search(f_contents[offset:])
        offset += match_od.end()
        opening = match_od.group(1)
        
        match_dl = patternDL.search(f_contents[offset:]) 
        offset += match_dl.end()
        deadline = match_dl.group(1)
        
        csvfile.writerow([code, title.strip(), opening, deadline])

这将为您提供result.csv,其中包含:

Topic ID,Title,Opening date,Deadline
TITLE-SDFSD-DFDS-SFDS-01-01,This is the title 1 that  is split into two lines with a blank line in the middle,15 Apr 2021,26 Aug 2021
TITLE-SDFSD-DFDS-SFDS-01-02,This is the title2 in one single line,15 March 2021,17 Aug 2021
TITLE-SDFSD-DFDS-SFDS-01-03,This is the title3 that is too long and takes two lines,15 May 2021,26 Sep 2021

关于python - 如何在 CSV 文件中添加使用正则表达式找到的信息,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/66048123/

相关文章:

python - 如何在循环中遍历 G (networkx multidigraph) 以导出为 OSM 或导出到某个数据库中?

python - 像列表一样生成字典值

php - 第一个字母如果有重音则消失(CSV 文件,UTF-8 编码)

python - 刻度线在颜色条上显示不正确

python - 正则表达式字符匹配计数器

javascript - javascript 正则表达式中的 if 语句

java - 如何在java中执行preg_replace_all

r - 如何在 r 中加载和合并多个 .csv 文件?

java - 无法使用 Spring Batch 解析仅用逗号分隔的 CSV

Python:要比较两个列表以进行格式匹配并制作成字典