Gensim LDA 主题分配

我希望使用 LDA 将每个文档分配给一个主题。现在我意识到你得到的是来自 LDA 的主题分布。然而，正如您从下面的最后一行中看到的那样，我将其分配给了最可能的主题。

我的问题是这样的。我必须第二次运行 lda[corpus] 才能获得这些主题。是否有其他一些内置的 gensim 函数可以直接给我这个主题分配向量？特别是因为 LDA 算法已经遍历了文档，它可能已经保存了这些主题分配？

    # Get the Dictionary and BoW of the corpus after some stemming/ cleansing
    texts = [[stem(word) for word in document.split() if word not in STOPWORDS] for document in cleanDF.text.values]
    dictionary = corpora.Dictionary(texts)
    dictionary.filter_extremes(no_below=5, no_above=0.9)
    corpus = [dictionary.doc2bow(text) for text in texts]

    # The actual LDA component
    lda = models.LdaMulticore(corpus=corpus, id2word=dictionary, num_topics=30, chunksize=10000, passes=10,workers=4) 

    # Assign each document to most prevalent topic
    lda_topic_assignment = [max(p,key=lambda item: item[1]) for p in lda[corpus]]

最佳答案

没有其他内置的 Gensim 函数可以直接给出主题分配向量。

您的问题是有效的，LDA 算法已通过文档，但 LDA 的实现正在通过更新 block 中的模型(基于 chunksize 参数的值)工作，因此它不会保留整个内存中的语料库。

因此你必须使用 lda[corpus] 或使用方法 lda.get_document_topics()

关于Gensim LDA 主题分配，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/39969919/

上一篇：python - 将 Pandas DataFrame 转换为 Spark DataFrame

下一篇：python - 如何在pytest中只运行未标记的测试

相关文章：

python - Word2vec 向量的长度有什么意义？

python - 简单算法的可扩展性

python - gensim LDA : How can i generate topics with different words for each topic?

python - 如何在用于主题建模的引导式 LDA 中生成术语矩阵？

r - R中潜在狄利克雷分配(LDA)中特定TOPIC的TERM概率是多少

scikit-learn - 如何在从 gensim 创建的 word2vec 上运行 tsne？

python - 处理文本时遇到字符编码问题

python - gensim - Word2vec 继续训练现有模型 - AttributeError : 'Word2Vec' object has no attribute 'compute_loss'

scala - 在spark中为LDA准备数据

python - 应用gensim LDA主题建模后，如何获取每个主题概率最高的文档并将其保存在csv文件中？