python - Python中基于印象的N-gram分析

标签 python text text-manipulation


enter image description here

我的目标是了解与一个单词、两个单词、三个单词、四个单词、五个单词和六个单词相关的展示次数。我曾经运行 N-gram 算法,但它只返回计数。这是我当前的 n-gram 代码。

def find_ngrams(text, n):
    word_vectorizer = CountVectorizer(ngram_range=(n,n), analyzer='word')
    sparse_matrix = word_vectorizer.fit_transform(text)
    frequencies = sum(sparse_matrix).toarray()[0]
    ngram = 

ngram = ngram.sort_values(by=['frequency'], ascending=[False])
return ngram

one = find_ngrams(df['query'],1)
bi = find_ngrams(df['query'],2)
tri = find_ngrams(df['query'],3)
quad = find_ngrams(df['query'],4)
pent = find_ngrams(df['query'],5)
hexx = find_ngrams(df['query'],6)

我认为我需要做的是: 1. 将查询拆分为一到六个单词。 2.给分割词附加印象。 3. 重新组合所有拆分词并对展示次数求和。


(1) 1-gram: dog, common, diseases, and, how, to, treat, them;
(2) 2-gram: dog common, common diseases, diseases and, and how, how to, to treat, treat them;
(3) 3-gram: dog common diseases, common diseases and, diseases and how, and how to, how to treat, to treat them;
(4) 4-gram: dog common diseases and, common diseases and how, diseases and how to, and how to treat, how to treat them;
(5) 5-gram: dog common diseases and how, the queries into one word, diseases and how to treat, and how to treat them;
(6) 6-gram: dog common diseases and how to, common diseases and how to treat, diseases and how to treat them;


这里有一个方法!不是最有效的,但是,我们不要过早优化。这个想法是使用 apply 获取一个新的 pd.DataFrame ,其中包含所有 ngram 的新列,将其与旧数据帧连接,并进行一些堆叠和分组。

import pandas as pd

df = pd.DataFrame({
    "squery": ["how to feed a dog", "dog habits", "to cat or not to cat", "dog owners"],
    "count": [1000, 200, 100, 150]

def n_grams(txt):
    grams = list()
    words = txt.split(' ')
    for i in range(len(words)):
        for k in range(1, len(words) - i + 1):
            grams.append(" ".join(words[i:i+k]))
    return pd.Series(grams)

counts = df.squery.apply(n_grams).join(df)

counts.drop("squery", axis=1).set_index("count").unstack()\
    .drop("level_0", axis=1).groupby("ngram")["count"].sum()

最后一个表达式将返回一个 pd.Series ,如下所示。

a                       1000
a dog                   1000
cat                      200
cat or                   100
cat or not               100
cat or not to            100
cat or not to cat        100
dog                     1350
dog habits               200
dog owners               150
feed                    1000
feed a                  1000
feed a dog              1000
habits                   200
how                     1000
how to                  1000
how to feed             1000
how to feed a           1000
how to feed a dog       1000
not                      100
not to                   100
not to cat               100
or                       100
or not                   100
or not to                100
or not to cat            100
owners                   150
to                      1200
to cat                   200
to cat or                100
to cat or not            100
to cat or not to         100
to cat or not to cat     100
to feed                 1000
to feed a               1000
to feed a dog           1000


这个可能更高效一些,但它仍然具体化了来自 CountVectorizer 的密集 n 元向量。它将每列上的该值乘以展示次数,然后将各列相加以获得每 ngram 的总展示次数。它给出与上面相同的结果。需要注意的一件事是,具有重复 ngram 的查询也会计数双倍。

import numpy as np
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer(ngram_range=(1, 5))
ngrams = cv.fit_transform(df.squery)
mask = np.repeat(df['count'].values.reshape(-1, 1), repeats = len(cv.vocabulary_), axis = 1)
index = list(map(lambda x: x[0], sorted(cv.vocabulary_.items(), key = lambda x: x[1])))
pd.Series(np.multiply(mask, ngrams.toarray()).sum(axis = 0), name = "counts", index = index)

关于python - Python中基于印象的N-gram分析,我们在Stack Overflow上找到一个类似的问题:


python - 无法将 Flask-SqlAlchemy 与 mamp 一起使用

javascript - 在 Internet Explorer 中禁用文本选择

java - 如何使用 Java 将 'prepared' 文本文件转换为 XML 文件?随后将在 SAX 中使用

linux - 使用 AWK 操作选定的变量

c# - 删除引号之间的文本

python - 在日期出现时插入标志


javascript - 如何在不失去焦点的情况下存储文本框的值?

c# - 从字符串中删除 Markdown 标签

javascript - 如何将 javascript 与 selenium python 一起使用