我正在用 Python 读取一个大文本文件,它看起来像下面这样(包含许多 Code
和 Description
信息)。
Over-ride Flag for Site/Laterality/Morphology (Interfield Edit 42)
This field is used to identify whether a case was reviewed and coding confirmed
for paired-organ primary
site cases with an in situ behavior and the laterality is not coded right,
left, or one side involved, right or left
origin not specified.
Code Description
Blank Not reviewed, or reviewed and corrected
1 Reviewed and confirmed as reported: A patient had behavior
code of in situ and laterality is not
stated as right: origin of primary; left: origin of primary; or only one side
involved, right or left
origin not specified
This field is used to identify whether a case was reviewed and coding confirmed
for cases with a non-
specific laterality code.
Code Description
Blank1 Not reviewed
11 A patient had laterality
coded non-specifically and
extension coded specifically
This field, new for 2018, indicates whether a case was reviewed and coding
............
从上面的自由文本中,我只需要将代码和描述值存储到两个列表中,如下所示。
code = ["Blank", "1", "Blank1", "11"]
des = ["Not reviewed, or reviewed and corrected", "Reviewed and confirmed as reported: A patient had behavior code of in situ and laterality is not stated as right: origin of primary; left: origin of primary; or only one side involved, right or left origin not specified", "Not reviewed", "A patient had laterality coded non-specifically and extension coded specifically"]
我怎样才能在 Python 中做到这一点?
注意:
Code
可以包含“Blank(或 Blank1)”关键字或数值。有时代码 Description
被分成多行。在上面的例子中,我展示了一个 Code
和 Description
块包含两个代码和两个描述。然而,一个Code
和 Description
块可以包含一个或多个代码和描述。
最佳答案
我们可以用算法/状态机来解决这个问题。以下代码在与 python 脚本相同的目录中打开名为“datafile.txt”的文件,对其进行解析并打印结果。该算法的关键是假设每两个字段之间只有空行,并且任何包含我们要记录的描述字段开头的行都将其代码属性与其描述属性分开三个或更多空格。据我从您的文件片段中得知,这些假设总是成立的。
index = -1
record = False
description_block = False
codes = []
descriptions = []
with open("datafile.txt", "r") as file:
for line in file:
line = [portion.strip() for portion in line.split(" ") if portion != ""]
if record:
if len(line) == 2:
index += 1
codes.append(line[0])
descriptions.append(line[1])
else:
if line[0]:
description_block = True
if description_block:
if not line[0]:
description_block = False
record = False
continue
else:
descriptions[index] += " "+line[0]
if line[0] == "Code":
record = True
print("codes:", codes)
print("descriptions:", descriptions)
结果:
codes: ['Blank', '1', 'Blank1', '11']
descriptions: ['Not reviewed, or reviewed and corrected', 'Reviewed and confirmed as reported: A patient had behavior code of in situ and laterality is not stated as right: origin of primary; left: origin of primary; or only one side involved, right or left origin not specified', 'Not reviewed', 'A patient had laterality coded non-specifically and extension coded specifically']
在 python 3.8.2 中测试
编辑:
更新代码以反射(reflect)注释中提供的整个数据文件。
import re
column_separator = " "
index = -1
record = False
block_exit = False
break_on_newline = False
codes = []
descriptions = []
templine = ""
def add(line):
global index
index += 1
block_exit = False
codes.append(line[0])
descriptions.append(line[1])
with open("test", "r", encoding="utf-8") as file:
while True:
line = file.readline()
if not line:
break
if record:
line = [portion.strip() for portion in line.split(column_separator) if portion != ""]
if len(line) > 1:
add(line)
else:
if block_exit:
record = False
block_exit = False
else:
if line[0]:
descriptions[index] += " "+line[0]
else:
while True:
line = [portion.strip() for portion in file.readline().split(column_separator) if portion != ""]
if not line:
break
if len(line) > 1:
if templine:
descriptions[index] += templine
templine = ""
add(line)
break
else:
print(line)
if line[0] and "Instructions" not in line[0]:
templine += " "+line[0]
else:
if break_on_newline:
break_on_newline = False
record = False
templine = ""
break
else:
templine += " "+line[0]
break_on_newline = True
else:
if line == "Code Description\n":
record = True
print("codes:", codes)
print("\n")
print("descriptions:", descriptions)
# for i in range(len(codes)):
# print(codes[i]+"\t\t", descriptions[i])
关于python - 如何从python中的碎片文本中提取信息?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/62034739/