python-2.7 - 如何在python中的给定文本中找到给定单词前后最频繁的单词?

标签 python-2.7 nlp nltk n-gram

我有一个大文本,我试图获取文本中给定单词前后最常出现的单词。

例如:

我想知道在“湖”之后出现频率最高的词是什么。理想情况下会得到类似的东西:(单词 1,# occurrence),(word 2,# occurrence),...

前面的词也一样...

我尝试了 NLTK bigran,但它似乎只能找到最常见的 n-grans...是否有可能以某种方式修复其中一个词并根据固定词找到最常见的 n-grans)?

感谢您的帮助!

最佳答案

你在找这样的东西吗?

text = """
A lake is a body of relatively still water of considerable size, localized in a basin, that is surrounded by land apart from a river, stream, or other form of moving water that serves to feed or drain the lake. Lakes are inland and not part of the ocean and therefore are distinct from lagoons, and are larger and deeper than ponds.[1][2] Lakes can be contrasted with rivers or streams, which are usually flowing. However most lakes are fed and drained by rivers and streams.
Natural lakes are generally found in mountainous areas, rift zones, and areas with ongoing glaciation. Other lakes are found in endorheic basins or along the courses of mature rivers. In some parts of the world there are many lakes because of chaotic drainage patterns left over from the last Ice Age. All lakes are temporary over geologic time scales, as they will slowly fill in with sediments or spill out of the basin containing them.
Many lakes are artificial and are constructed for industrial or agricultural use, for hydro-electric power generation or domestic water supply, or for aesthetic or recreational purposes.
Etymology, meaning, and usage of "lake"[edit]
Oeschinen Lake in the Swiss Alps
Lake Tahoe on the border of California and Nevada
The Caspian Sea is either the world's largest lake or a full-fledged sea.[3]
The word lake comes from Middle English lake ("lake, pond, waterway"), from Old English lacu ("pond, pool, stream"), from Proto-Germanic *lakō ("pond, ditch, slow moving stream"), from the Proto-Indo-European root *leǵ- ("to leak, drain"). Cognates include Dutch laak ("lake, pond, ditch"), Middle Low German lāke ("water pooled in a riverbed, puddle"), German Lache ("pool, puddle"), and Icelandic lækur ("slow flowing stream"). Also related are the English words leak and leach.
There is considerable uncertainty about defining the difference between lakes and ponds, and no current internationally accepted definition of either term across scientific disciplines or political boundaries exists.[4] For example, limnologists have defined lakes as water bodies which are simply a larger version of a pond, which can have wave action on the shoreline or where wind-induced turbulence plays a major role in mixing the water column. None of these definitions completely excludes ponds and all are difficult to measure. For this reason there has been increasing use made of simple size-based definitions to separate ponds and lakes. One definition of lake is a body of water of 2 hectares (5 acres) or more in area;[5]:331[6] however, others[who?] have defined lakes as waterbodies of 5 hectares (12 acres) and above,[citation needed] or 8 hectares (20 acres) and above[citation needed] (see also the definition of "pond"). Charles Elton, one of the founders of ecology, regarded lakes as waterbodies of 40 hectares (99 acres) or more.[7] The term lake is also used to describe a feature such as Lake Eyre, which is a dry basin most of the time but may become filled under seasonal conditions of heavy rainfall. In common usage many lakes bear names ending with the word pond, and a lesser number of names ending with lake are in quasi-technical fact, ponds. One textbook illustrates this point with the following: "In Newfoundland, for example, almost every lake is called a pond, whereas in Wisconsin, almost every pond is called a lake."[8]
One hydrology book proposes to define it as a body of water with the following five chacteristics:[4]
it partially or totally fills one or several basins connected by straits[4]
has essentially the same water level in all parts (except for relatively short-lived variations caused by wind, varying ice cover, large inflows, etc.)[4]
it does not have regular intrusion of sea water[4]
a considerable portion of the sediment suspended in the water is captured by the basins (for this to happen they need to have a sufficiently small inflow-to-volume ratio)[4]
the area measured at the mean water level exceeds an arbitrarily chosen threshold (for instance, one hectare)[4]
With the exception of the sea water intrusion criterion, the other ones have been accepted or elaborated upon by other hydrology publications.[9][10]
""".split()

from nltk import bigrams

bgs = bigrams(text)
lake_bgs = filter(lambda item: item[0] == 'lake', bgs)

from collections import Counter
c = Counter(map(lambda item: item[1], lake_bgs))
print c.most_common()

哪个输出:

[('is', 4), ('("lake,', 1), ('or', 1), ('comes', 1), ('are', 1)]

请注意,如果您的文本很长,您可能需要使用 ifilter、imap 等...

编辑:这是之前和'lake'之后的代码。

from nltk import trigrams

tgs = trigrams(text)
lake_tgs = filter(lambda item: item[1] == 'lake', tgs)

from collections import Counter

before_lake = map(lambda item: item[0], lake_tgs)
after_lake = map(lambda item: item[2], lake_tgs)

c = Counter(before_lake + after_lake)
print c.most_common()

请注意,这也可以使用 bigrams 来完成:)

关于python-2.7 - 如何在python中的给定文本中找到给定单词前后最频繁的单词?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/20020506/

相关文章:

python - Tkinter:create_window() 和框架未正确显示

javascript - PCFG(NLP)中的内外算法

python charmap 编解码器无法将位置 Y 字符映射中的字节 X 解码为 <undefined>

python - 更快地计算字符串中的短语数

nltk - 零碎训练 NaiveBayesClassifier (NLTK)

mysql - SQLAlchemy,它如何连接到在 mysql workbench 中创建的现有 MYSQL 数据库?

Python 谷歌语音

python - 是否可以使用 python 在 Dialogflow 中触发意图?

nlp - 撰写包含 160 位可恢复信息的合成英语短语

python - 在 nltk for python 中编辑 Vader_lexicon.txt 以添加与我的域相关的词