python - 如何使用文件中的生成器进行标记化而不是具体化字符串列表？

我有 2 个文件:

hyp.txt

It is a guide to action which ensures that the military always obeys the commands of the party
he read the book because he was interested in world history

ref.txt

It is a guide to action that ensures that the military will forever heed Party commands
he was interested in world history because he read the book

我有一个函数可以进行一些计算来比较文本的行，例如hyp.txt 的第 1 行和 ref.txt 的第 1 行。

def scorer(list_of_tokenized_hyp, list_of_tokenized_ref):
   """
   :type list_of_tokenized_hyp: iter(iter(str))
   :type list_of_tokenized_ref: iter(iter(str))
   """   
   for hypline, refline in zip(list_of_tokenized_hyp, list_of_tokenized_ref):
       # do something with the iter(str)
   return score

而且这个功能不能改变。但是，我可以操纵我提供给函数的内容。所以目前我正在将文件提供给这样的函数:

with open('hyp.txt', 'r') as hypfin, open('ref.txt', 'r') as reffin:
    hyp = [line.split() for line in hypfin]
    ref = [line.split() for line in reffin]
    scorer(hypfin, reffin)

但通过这样做，我已经将整个文件和拆分后的字符串加载到内存中，然后再将其送入 scorer()。

知道 scorer() 正在逐行处理文件，有没有办法在不更改 scorer( )函数?

有没有办法改为输入某种发电机？

我已经试过了:

with open('hyp.txt', 'r') as hypfin, open('ref1.txt', 'r') as ref1fin, open('ref2.txt', 'r') as ref2fin:
    hyp = (h.split() for h in hypline)
    ref = (r.split() for r in hypline)
    scorer(hypfin, reffin)

但我不确定 h.split() 是否已经实现。 如果它已经物化了，为什么？如果不是，为什么？

如果我可以更改 scorer() 函数，那么我可以轻松地在 for 之后添加这一行:

def scorer(list_of_tokenized_hyp, list_of_tokenized_ref):
   for hypline, refline in zip(list_of_tokenized_hyp, list_of_tokenized_ref):
       hypline = hypline.split()
       refline = refline.split()
       # do something with the iter(str)
   return score

但这在我的情况下是不可能的，因为我无法更改该功能。

最佳答案

是的，你的例子定义了两个生成器

with open('hyp.txt', 'r') as hypfin, open('ref1.txt', 'r') as reffin:
    hyp = (h.split() for h in hypfin)
    ref = (r.split() for r in reffin)
    scorer(hyp, ref)

和拆分，以及下一行的相应读取，是针对每个 for 循环迭代完成的。

关于python - 如何使用文件中的生成器进行标记化而不是具体化字符串列表？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/34582587/

python - 如何使用文件中的生成器进行标记化而不是具体化字符串列表？

上一篇：python - 如何查找一个列表是否按顺序是另一个列表的子集？

下一篇：python - Tensorflow:成本张量列表