python - 从以特定字符开头的文件中删除空记录

我有一个包含 DBLP 数据集的文件，该数据集由计算机科学中的书目数据组成。我想删除一些缺少信息的记录。例如，我想删除缺少 field 的记录。在此数据集中， field 后跟“#c”。

在此代码中，我按手稿标题(“#*”)拆分文档。现在，我正在尝试删除没有 field 名称的记录。

输入数据:

#*Toward Connectionist Parsing.

#@Steven L. Small,Garrison W. Cottrell,Lokendra Shastri

#t1982

#c

#index14997


#*A Framework for Reinforcement Learning on Real Robots.

#@William D. Smart,Leslie Pack Kaelbling

#t1998

#cAAAI/IAAI

#index14998

#*Efficient Goal-Directed Exploration.

#@Yury V. Smirnov,Sven Koenig,Manuela M. Veloso,Reid G. Simmons

#t1996

#cAAAI/IAAI, Vol. 1

#index14999

我的代码:

inFile = open('lorem.txt','r')
Data = inFile.read()
data = Data.split("#*")
ouFile = open('testdata.txt','w')
for idx, word in enumerate(data):
    print("i = ", idx)
    if not('#!' in data[idx]):
        del data[idx]
        idx = idx - 1
    else:
        ouFile.write("#*" + data[idx])
ouFile.close()
inFile.close()

预期输出:

#*A Framework for Reinforcement Learning on Real Robots.

#@William D. Smart,Leslie Pack Kaelbling

#t1998

#cAAAI/IAAI

#index14998

#*Efficient Goal-Directed Exploration.

#@Yury V. Smirnov,Sven Koenig,Manuela M. Veloso,Reid G. Simmons

#t1996

#cAAAI/IAAI, Vol. 1

#index14999

实际输出: 一个空的输出文件

最佳答案

str.find将为您提供子字符串的索引，如果子字符串不存在，则为 -1。

DOCUMENT_SEP = '#*'

with open('lorem.txt') as in_file:
    documents = in_file.read().split(DOCUMENT_SEP)

with open('testdata.txt', 'w') as out_file:
    for document in documents:
        i = document.find('#c')
        if i < 0:  # no "#c"
            continue
        # "#c" exists, but no trailing venue information
        if not document[i+2:i+3].strip():
            continue
        out_file.write(DOCUMENT_SEP)
        out_file.write(document)

我没有手动关闭，而是使用了 with 语句。
无需使用索引；在循环中间删除一个项目会使索引计算变得复杂。
使用正则表达式，如 #c[A-Z].. 将使代码更简单。

关于python - 从以特定字符开头的文件中删除空记录，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/57285784/

python - 从以特定字符开头的文件中删除空记录

上一篇：python - 如何删除在部分列上具有重复值的行？

下一篇：Python 循环具有极长的运行时间