python - 计算行出现次数并除以总行数 - unix/python

我有一个文本文件test.in:

english<tab>walawala
foo bar<tab>laa war
foo bar<tab>laa war
hello world<tab>walo lorl
hello world<tab>walo lorl
foo bar<tab>laa war

期望的输出应该是:

english<tab>walawala<tab>0.1666
foo bar<tab>laa war<tab>0.5
hello world<tab>walo lorl<tab>0.3333

新列是行数除以总行数。

目前我正在这样做:

cat test.in | uniq -c | awk '{print $2"\t"$3"\t"$1}' > test.out

但这只给我行数而不是概率。另外，我的文件非常大，大约 1,000,000,000 行，每列至少 20 个字符。

如何正确快速地获得所需的输出？

有没有一样快的 pythonic 解决方案？

最佳答案

请注意，uniq 只计算重复的行，并且必须在它之前加上 sort 才能考虑文件中的所有行。对于 排序 | uniq -c，以下代码使用collections.Counter更有效，因为它根本不需要对任何东西进行排序:

from collections import Counter

with open('test.in') as inf:
    counts = sorted(Counter(line.strip('\r\n') for line in inf).items())
    total_lines = float(sum(i[1] for i in counts))
    for line, freq in counts:
         print("{}\t{:.4f}".format(line, freq / total_lines))

这个脚本输出

english<tab>walawala<tab>0.1667
foo bar<tab>laa war<tab>0.5000
hello world<tab>walo lorl<tab>0.3333

对于您的描述中给出的输入。

但是，如果您只需要合并连续的行，例如 uniq -c，请注意任何使用 Counter 的解决方案都会给出在中给出的输出你的问题，但是你的 uniq -c 方法不会。 uniq -c 的输出将是:

  1 english<tab>walawala
  2 foo bar<tab>laa war
  2 hello world<tab>walo lorl
  1 foo bar<tab>laa war

不是

  1 english<tab>walawala
  3 foo bar<tab>laa war
  2 hello world<tab>walo lorl

如果这是您想要的行为，您可以使用itertools.groupby :

from itertools import groupby

with open('foo.txt') as inf:
    grouper = groupby(line.strip('\r\n') for line in inf)
    items = [ (k, sum(1 for j in i)) for (k, i) in grouper ]
    total_lines = float(sum(i[1] for i in items))
    for line, freq in items:
        print("{}\t{:.4f}".format(line, freq / total_lines))

不同之处在于，如果 test.in 具有您规定的内容，uniq 管道将不会产生您在示例中提供的输出，而您会得到:

english<tab>walawala<tab>0.1667
foo bar<tab>laa war<tab>0.3333
hello world<tab>walo lorl<tab>0.3333
foo bar<tab>laa war<tab>0.1667

因为这不是您的输入示例所说的，可能是您不能在没有 sort 的情况下使用 uniq 来解决您的问题 - 那么您需要求助于我的第一个示例和 Python 肯定会比您的 Unix 命令行更快。

顺便说一下，这些在所有 Python 2.6 中的工作方式相同。

关于python - 计算行出现次数并除以总行数 - unix/python，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/25156747/

python - 计算行出现次数并除以总行数 - unix/python

上一篇：python - 在 Python 中构建字符串的最佳方法

下一篇：python - 在 python 中创建一个充满字母的集合