我想计算已转换为标记的文本文件中特定单词前后三个单词的频率。
from nltk.tokenize import sent_tokenize
from nltk.tokenize import word_tokenize
from nltk.util import ngrams
with open('dracula.txt', 'r', encoding="ISO-8859-1") as textfile:
text_data = textfile.read().replace('\n', ' ').lower()
tokens = nltk.word_tokenize(text_data)
text = nltk.Text(tokens)
grams = nltk.ngrams(tokens, 4)
freq = Counter(grams)
freq.most_common(20)
我不知道如何搜索字符串“dracula”作为过滤词。我也尝试过:
text.collocations(num=100)
text.concordance('dracula')
所需的输出看起来像这样,带有计数: “dracula”之前的三个单词,已排序计数
(('and', 'he', 'saw', 'dracula'), 4),
(('one', 'cannot', 'see', 'dracula'), 2)
“dracula”后面的三个单词,已排序计数
(('dracula', 'and', 'he', 'saw'), 4),
(('dracula', 'one', 'cannot', 'see'), 2)
中间包含“dracula”的三元组,已排序计数
(('count', 'dracula', 'saw'), 4),
(('count', 'dracula', 'cannot'), 2)
预先感谢您的帮助。
最佳答案
一旦获得元组格式的频率信息(如您所做的那样),您就可以使用 if
语句简单地过滤出您要查找的单词。这是使用 Python 的列表理解语法:
from nltk.tokenize import sent_tokenize
from nltk.tokenize import word_tokenize
from nltk.util import ngrams
with open('dracula.txt', 'r', encoding="ISO-8859-1") as textfile:
text_data = textfile.read().replace('\n', ' ').lower()
# pulled text from here: https://archive.org/details/draculabr00stokuoft/page/n6
tokens = nltk.word_tokenize(text_data)
text = nltk.Text(tokens)
grams = nltk.ngrams(tokens, 4)
freq = nltk.Counter(grams)
dracula_last = [item for item in freq.most_common() if item[0][3] == 'dracula']
dracula_first = [item for item in freq.most_common() if item[0][0] == 'dracula']
dracula_second = [item for item in freq.most_common() if item[0][1] == 'dracula']
# etc.
这会生成在不同位置包含“dracula”的列表。 dracula_last
如下所示:
[(('the', 'castle', 'of', 'dracula'), 3),
(("'s", 'journal', '243', 'dracula'), 1),
(('carpathian', 'moun-', '2', 'dracula'), 1),
(('of', 'the', 'castle', 'dracula'), 1),
(('named', 'by', 'count', 'dracula'), 1),
(('disease', '.', 'count', 'dracula'), 1),
...]
关于python - 使用文本搭配计算 ngram 词频,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/54471926/