python - 计算文本文件中的不同单词 : different results in Shell and Python

我在 Shell 和 Python 中运行以下 2 个脚本来计算文本文件中唯一单词的数量。然而，结果却大不相同(Python 中为 123,832，而 Shell 中为 185,948)。您能否帮我解释一下造成差异的原因以及如何使 Shell 命令返回与 Python 相同的结果？

Python代码如下:

def count_vocab(text):

    # Normalize the text and get the vocabulary size
    tokens = list(set(text.lower().split()))

    # Remove all tokens that are not alphabetic
    words = [word for word in tokens if word.isalpha()]

    vocab_size = len(words)

    return vocab_size

我遵循了答案 here在 Shell 中运行命令。

tr -cd "[:alpha:][:space:]-'" < <text_file> \
| tr ' [:upper:]' '\n[:lower:]' \
| tr -s '\n' \
| sed "s/^['-]*//;s/['-]$//" \
| sort \
| uniq -c \
| wc -l > <num_words.txt>

我也尝试了以下 2 个，但结果与 Python 结果相去甚远。

tr ' [:upper:]' '\n[:lower:]' < <text_file> \
| tr -s '\n' \
| tr -cd "[:alpha:]\n'" \
| sort \
| uniq -c \
| wc -l > <num_words.txt>

tr -cd "[:alpha:][:space:]\n'" < <text_file> \
| tr ' [:upper:]' '\n[:lower:]' \
| tr -s '\n' \
| sort \
| uniq -c \
| wc -l > <num_words.txt>

非常感谢您的帮助!

最佳答案

好的，所以，shellscript 中的问题(假设您希望 shellscript 的行为像 python 那样)是在您提供的第一个命令中。

考虑输入

apple cherry bone0 cherry

python 函数将在去除包含非字母的单词的步骤中将其转换为

apple cherry cherry

而你的 shellscript 会简单地做

apple cherry bone cherry

这是因为 shellscript 的第一行，它简单地删除了数字(来 self 对它的单独快速测试)。相反，您希望第一行类似于 grep -wo -E [a-zA-Z]+，这将拒绝与特定正则表达式不匹配的单词。 (也就是包含字母以外的任何单词)

另外，感谢它的到期，我从 here 获得了补丁

所以，固定的 shellscript 是(以很好的函数形式)

function count_vocab() {
    grep -wo -E '[a-zA-Z]+' |
        tr ' [:upper:]' '\n[:lower:]' |
        tr -s '\n' |
        sed "s/^['-]*//;s/['-]$//" |
        sort |
        uniq -c |
        wc -l
}

像这样调用(在你定义函数之后)

count_vocab < INPUT_TEXT_FILE > COUNT_FILE

关于python - 计算文本文件中的不同单词 : different results in Shell and Python，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/57366801/

python - 计算文本文件中的不同单词 : different results in Shell and Python

上一篇：python - 多次调用后更改内部函数中的非局部变量的结果

下一篇：python - 如何使用 Python 从 7z 存档中读取文件？