我有 2 个文件:
hyp.txt
It is a guide to action which ensures that the military always obeys the commands of the party
he read the book because he was interested in world history
ref.txt
It is a guide to action that ensures that the military will forever heed Party commands
he was interested in world history because he read the book
我有一个函数可以进行一些计算来比较文本的行,例如hyp.txt 的第 1 行和 ref.txt 的第 1 行。
def scorer(list_of_tokenized_hyp, list_of_tokenized_ref):
"""
:type list_of_tokenized_hyp: iter(iter(str))
:type list_of_tokenized_ref: iter(iter(str))
"""
for hypline, refline in zip(list_of_tokenized_hyp, list_of_tokenized_ref):
# do something with the iter(str)
return score
而且这个功能不能改变。但是,我可以操纵我提供给函数的内容。所以目前我正在将文件提供给这样的函数:
with open('hyp.txt', 'r') as hypfin, open('ref.txt', 'r') as reffin:
hyp = [line.split() for line in hypfin]
ref = [line.split() for line in reffin]
scorer(hypfin, reffin)
但通过这样做,我已经将整个文件和拆分后的字符串加载到内存中,然后再将其送入 scorer()
。
知道 scorer()
正在逐行处理文件,有没有办法在不更改 scorer( )
函数?
有没有办法改为输入某种发电机?
我已经试过了:
with open('hyp.txt', 'r') as hypfin, open('ref1.txt', 'r') as ref1fin, open('ref2.txt', 'r') as ref2fin:
hyp = (h.split() for h in hypline)
ref = (r.split() for r in hypline)
scorer(hypfin, reffin)
但我不确定 h.split()
是否已经实现。 如果它已经物化了,为什么?如果不是,为什么?
如果我可以更改 scorer()
函数,那么我可以轻松地在 for
之后添加这一行:
def scorer(list_of_tokenized_hyp, list_of_tokenized_ref):
for hypline, refline in zip(list_of_tokenized_hyp, list_of_tokenized_ref):
hypline = hypline.split()
refline = refline.split()
# do something with the iter(str)
return score
但这在我的情况下是不可能的,因为我无法更改该功能。
最佳答案
是的,你的例子定义了两个生成器
with open('hyp.txt', 'r') as hypfin, open('ref1.txt', 'r') as reffin:
hyp = (h.split() for h in hypfin)
ref = (r.split() for r in reffin)
scorer(hyp, ref)
和拆分
,以及下一行的相应读取,是针对每个 for 循环迭代完成的。
关于python - 如何使用文件中的生成器进行标记化而不是具体化字符串列表?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/34582587/