python - 在 Python 中使用正则表达式定位回车符和换行符的问题

标签 python regex python-2.7 newline carriage-return

我有一些代码独立于一个更大的程序工作,但现在在更大的程序中它似乎不起作用——即它没有执行所需的操作。

问题出现在第 4 步(见下文),经过反射(reflection),我在字符类中的预期逻辑(即“除回车外的所有内容”)似乎没有正确编码(但我没有知道如何“表达”逻辑)。

我的目标只是用段落标签包裹每一行或每一段。

Python代码

import re

# 1.  open the html file in read mode
html_file = open('test.html', 'r')

# 2.  convert to string
html_file_as_string = html_file.read()

# 3.  close the html file
html_file.close()

# 4.  replace carriage returns with closing and opening paragraph tags
html_file_as_string = re.sub('([^\r]*)\r', r'\1</p>\n<p>', html_file_as_string)

# 5.  remove time and date
html_file_as_string = re.sub(r'(Lorem ipsum \d*/\d*/\d*, \d*:\d* [a-z]{2})', r"", html_file_as_string)

# 6.  remove the white space after the opening paragraph tags
html_file_as_string = re.sub('<p>\n*\s*', r"<p>", html_file_as_string)

# 7.  remove the white space before the closing paragraph tags
html_file_as_string = re.sub('\s*</p>', r"</p>", html_file_as_string)

# 8.  open the file in write mode to clear
html_file = open('test.html', 'w')

# 9.  write the new contents to file
html_file.write(html_file_as_string)

# 10.  print to screen so we can see what is happening
print html_file_as_string

# 11.  close the html file
html_file.close()

这是 HTML 文件的内容:

<p>
Lorem ipsum dolor sit amet, consectetur adipiscing elit.
   Lorem ipsum..consectetur adipiscing elit.
   Lorem ipsum dolor sit amet, consectetur adipiscing elit. Lorem ipsum dolor sit amet, consectetur adipiscing elit.
   Lorem ipsum dolor sit amet, consectetur adipiscing elit.Lorem ipsum dolor sit amet, consectetur adipiscing elit."Lorem ipsum dolor sit amet", consectetur adipisc'ing elit.Lorem ipsum dolor...sit amet, consectetur adipiscing elit..
   Lorem ipsum dolor sit amet, consectetur adipiscing elit.Lorem ipsum dolor sit amet, consectetur adipiscing elit.
   Lorem ipsum dolor sit amet, consectetur adipiscing elit..
   .....Lorem ipsum dolor sit amet, consectetur adipiscing elit.Lorem ipsum dolor sit amet, consectetur adipiscing elit.Lorem ipsum dolor sit amet, consectetur adipiscing elit.Lorem ipsum dolor sit amet, consectetur adipiscing elit.Lorem ipsum dolor sit amet, consectetur adipiscing elit.Lorem ipsum 01/01/05, 05:00 am</p>

这是在 SciTE 编辑器中查看的文件内容(因此可以看到空格、回车符和换行符)。

enter image description here

编辑:

我根据下面的建议更改了正则表达式,然后将替换加倍两次(从第 4 步中可见的原始代码的更改和第 4 步之前的第 6 步的复制)。

工作代码:

import re

# 1.  open the html file in read mode
html_file = open('test.html', 'r')

# 2.  convert to string
html_file_as_string = html_file.read()

# 3.  close the html file
html_file.close()

# 6(added).  remove the white space after the opening paragraph tags
html_file_as_string = re.sub('<p>\n*\s*', r"<p>", html_file_as_string)

# 4(changed).  replace carriage returns with closing and opening paragraph tags
html_file_as_string = re.sub('([^\r\n]*)(\r\n?|\n)', r'\1</p>\2<p>', html_file_as_string)

# 5.  remove time and date
html_file_as_string = re.sub(r'(Lorem ipsum \d*/\d*/\d*, \d*:\d* [a-z]{2})', r"", html_file_as_string)

# 6.  remove the white space after the opening paragraph tags
html_file_as_string = re.sub('<p>\n*\s*', r"<p>", html_file_as_string)

# 7.  remove the white space before the closing paragraph tags
html_file_as_string = re.sub('\s*</p>', r"</p>", html_file_as_string)

# 8.  open the file in write mode to clear
html_file = open('test.html', 'w')

# 9.  write the new contents to file
html_file.write(html_file_as_string)

# 10.  print to screen so we can see what is happening
print html_file_as_string

# 11.  close the html file
html_file.close()

编辑 2:

上面的代码在代码的其他部分过于激进,做了太多的修改,回到绘图板。

最佳答案

我认为您最好遍历字符串,而不是像 Python 一样尝试使用正则表达式来做一些事情,您可以通过几个步骤完成此操作

import re
parsed_html = []

# Open the file and close it after being read
with open('test.html', 'r') as html_file:
    lines = html_file.readlines()

# Iterate through each line using \r\n, \r or \n as separators 
for line in lines:
    # Remove whitespace chars before and after the content (6 & 7)
    line = line.strip()

    # Skip empty lines (you could merge this and latter if)
    if not line:
         continue

    # Skip lines with only <p> or </p> (check output and remove if not needed)
    if re.match("^</?p>$", line):
         continue

    # Remove the date line (using + instead of * as * matches 0 or more entries)
    line = re.sub(r'Lorem ipsum \d+/\d+/\d+, \d+:\d+ (am|pm)', '', line)

    # Make sure we add p tag and CL/LR to the end of each line
    line = "<p>{0}</p>\r\n".format(line)

    # Append current line to a list that we will use to write the file
    parsed_html.append(line)

# Write contents of the parsed html to the file
with open('test.html', 'w') as html_file:
    html_file.writelines(parsed_html)

关于python - 在 Python 中使用正则表达式定位回车符和换行符的问题,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/17131353/

相关文章:

python - matplotlib 中的 3D 离散热图

java - 检查特殊字符 "$"的正则表达式

python : Using regex to get episode

python - 使用 python os 或 subprocess 模块在后台运行命令行程序

python - 排序/分组中的 Lambda 函数

python-2.7 - 适用于 Windows 的 Dbus-Python

javascript - 如何为网页 <select> 框创建复杂的数据绑定(bind)?

python - 印加开放实验 Python

python - PySpark 插入到覆盖

java - 正则表达式匹配除特定模式之外的任何内容