我正在尝试解析 txt 文件中两个不同标签的内容,并且我获取了第一个标签“p”的所有实例,但没有获取第二个“l”的实例。是“或”有问题吗?
感谢您的帮助。这是我正在使用的代码
with open('standardA00456.txt','w') as output_file:
with open('standardA00456.txt','r') as open_file:
the_whole_file = open_file.read()
start_position = 0
while True:
start_position = the_whole_file.find('<p>' or '<l>', start_position)
end_position = the_whole_file.find('</p>' or '</l>', start_position)
data = the_whole_file[start_position:end_position+5]
output_file.write(data + "\n")
start_position = end_position
最佳答案
'<p>' or '<l>'
将始终等于 '<p>'
,因为它告诉 Python 仅当 '<l>'
为 '<p>'
、 None
、数字零或空时才使用 False
。由于字符串 '<p>'
从来都不是其中之一,因此始终会跳过 '<l>'
:
>>> '<p>' or '<l>'
'<p>'
>>> None or '<l>'
'<l>'
相反,您可以轻松地使用 re.findall
:
import re
with open('standardA00456.txt','w') as out_f, open('standardA00456.txt','r') as open_f:
p_or_ls = re.findall(r'(?:<p>.*?</p>)|(?:<l>.*?</l>)',
open_f.read(),
flags=re.DOTALL) #to include newline characters
for p_or_l in p_or_ls:
out_f.write(p_or_l + "\n")
但是,使用正则表达式解析带有标签(例如 HTML 和 XML)的文件是 not a good idea 。使用模块,例如 BeautifulSoup 更安全:
from bs4 import BeautifulSoup
with open('standardA00456.txt','w') as out_f, open('standardA00456.txt','r') as open_f:
soup = BeautifulSoup(open_f.read())
for p_or_l in soup.find_all(["p", "l"]):
out_f.write(p_or_l + "\n")
关于Python txt文件标签解析,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/25230041/