python - Tfidfvectorizer - 从变换中获取具有权重的特征

假设我用于单个文档

text="bla agao haa"
singleTFIDF = TfidfVectorizer(analyzer='char_wb', ngram_range= 
(4,6),preprocessor=my_tokenizer, max_features=100).fit([text])

single=singleTFIDF.transform([text])
query = singleTFIDF.transform(["new coming document"])

如果我理解正确，变换只是使用从拟合中学习到的权重。因此，对于新文档，查询包含文档中每个特征的权重。看起来像 [[0,,0,0.13,0.4,0]]

由于我使用 n-gram，我也想获得这个新文档的功能。所以我知道新文档中每个功能的权重。

编辑:

在我的例子中，我得到 single 并查询以下数组:

single
[[0.10721125 0.10721125 0.10721125 0.10721125 0.10721125 0.10721125
  0.10721125 0.10721125 0.10721125 0.10721125 0.10721125 0.10721125
  0.10721125 0.10721125 0.10721125 0.10721125 0.10721125 0.10721125
  0.10721125 0.10721125 0.10721125 0.10721125 0.10721125 0.10721125
  0.10721125 0.10721125 0.10721125 0.10721125 0.10721125 0.10721125
  0.10721125 0.10721125 0.10721125 0.10721125 0.10721125 0.10721125
  0.10721125 0.10721125 0.10721125 0.10721125 0.10721125 0.10721125
  0.10721125 0.10721125 0.10721125 0.10721125 0.10721125 0.10721125
  0.10721125 0.10721125 0.10721125 0.10721125 0.10721125 0.10721125
  0.10721125 0.10721125 0.10721125 0.10721125 0.10721125 0.10721125
  0.10721125 0.10721125 0.10721125 0.10721125 0.10721125 0.10721125
  0.10721125 0.10721125 0.10721125 0.10721125 0.10721125 0.10721125
  0.10721125 0.10721125 0.10721125 0.10721125 0.10721125 0.10721125
  0.10721125 0.10721125 0.10721125 0.10721125 0.10721125 0.10721125
  0.10721125 0.10721125 0.10721125]]
query
[[0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.57735027 0.57735027 0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.57735027 0.         0.
  0.         0.         0.        ]]

但这很奇怪，因为从学习到的语料库(单个)中，所有特征的权重都是 0.10721125。那么新文档的某个特征的权重怎么可能是0.57735027呢？

最佳答案

有关 Scikit-Learn 如何计算 tfidf 的详细信息，请参阅 here下面是使用单词 n-gram 实现的示例。

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# Train the vectorizer
text="this is a simple example"
singleTFIDF = TfidfVectorizer(ngram_range=(1,2)).fit([text])
singleTFIDF.vocabulary_ # show the word-matrix position pairs

# Analyse the training string - text
single=singleTFIDF.transform([text])
single.toarray()  # displays the resulting matrix - all values are equal because all terms are present

# Analyse two new strings with the trained vectorizer
doc_1 = ['is this example working', 'hopefully it is a good example', 'no matching words here']

query = singleTFIDF.transform(doc_1)
query.toarray() # displays the resulting matrix - only matched terms have non-zero values

# Compute the cosine similarity between text and doc_1 - the second string has only two matching terms, therefore it has a lower similarity value
cos_similarity = cosine_similarity(single.A, query.A)

输出:

singleTFIDF.vocabulary_ 
Out[297]: 
{'this': 5,
 'is': 1,
 'simple': 3,
 'example': 0,
 'this is': 6,
 'is simple': 2,
 'simple example': 4}

single.toarray()
Out[299]: 
array([[0.37796447, 0.37796447, 0.37796447, 0.37796447, 0.37796447,
        0.37796447, 0.37796447]])

query.toarray()
Out[311]: 
array([[0.57735027, 0.57735027, 0.        , 0.        , 0.        ,
        0.57735027, 0.        ],
       [0.70710678, 0.70710678, 0.        , 0.        , 0.        ,
        0.        , 0.        ],
       [0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        ]])

np.sum(np.square(query.toarray()), axis=1) # note how all rows with non-zero scores have been normalised to 1.
Out[3]: array([1., 1., 0.])

cos_similarity
Out[313]: array([[0.65465367, 0.53452248, 0.        ]])

关于python - Tfidfvectorizer - 从变换中获取具有权重的特征，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/54314538/

python - Tfidfvectorizer - 从变换中获取具有权重的特征

上一篇：python - 如何在 QTableWidget 中将 bool 项显示为复选框？

下一篇：python - 列出过去 x 年夏令时的开始和结束时间