我想从文件中读取两个字符(“#*”
和“#@”
)之间的文本。我的文件包含数千条上述格式的记录。我尝试使用下面的代码,但它没有返回所需的输出。我的数据包含给定格式的数千条记录。
import re
start = '#*'
end = '#@'
myfile = open('lorem.txt')
for line in fhand:
text = text.rstrip()
print (line[line.find(start)+len(start):line.rfind(end)])
myfile.close()
我的输入:
\#*OQL[C++]: Extending C++ with an Object Query Capability
\#@José A. Blakeley
\#t1995
\#cModern Database Systems
\#index0
\#*Transaction Management in Multidatabase Systems
\#@Yuri Breitbart,Hector Garcia-Molina,Abraham Silberschatz
\#t1995
\#cModern Database Systems
\#index1
我的输出:
51103
OQL[C++]: Extending C++ with an Object Query Capability
t199
cModern Database System
index
...
预期输出:
OQL[C++]: Extending C++ with an Object Query Capability
Transaction Management in Multidatabase Systems
最佳答案
您正在逐行阅读文件,但您的匹配跨行。您需要读入文件并使用可以跨行匹配任何字符的正则表达式对其进行处理:
import re
start = '#*'
end = '#@'
rx = r'{}.*?{}'.format(re.escape(start), re.escape(end)) # Escape special chars, build pattern dynamically
with open('lorem.txt') as myfile:
contents = myfile.read() # Read file into a variable
for match in re.findall(rx, contents, re.S): # Note re.S will make . match line breaks, too
# Process each match individually
参见 regex demo .
关于regex - 如何从 Python 文件中提取两个子字符串之间的文本,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/57143822/