python - 将潜在狄利克雷分配与 Gensim 结合使用

标签 python lda gensim

我正在开发一个项目,我想使用潜在狄利克雷分配来从大量文章中提取主题。

我的代码是这样的:

import gensim
import csv
import json
import glob
from gensim import corpora, models
from nltk.corpus import stopwords
from nltk.tokenize import RegexpTokenizer
from time import gmtime, strftime

tokenizer = RegexpTokenizer(r'\w+')
cachedStopWords = set(stopwords.words("english"))
body = []
processed = []

with open('/…/file.json') as j:
    data = json.load(j)

for i in range(0,len(data)):
    body.append(data[i]['text'].lower())

for entry in body:
    row = tokenizer.tokenize(entry)
    processed.append([word for word in row if word not in cachedStopWords])

dictionary = corpora.Dictionary(processed)
corpus = [dictionary.doc2bow(text) for text in processed]
lda = gensim.models.ldamodel.LdaModel(corpus, id2word=dictionary, num_topics=50, update_every=1, passes=1)
topics = lda.show_topics(num_topics=50, num_words=8)

other_doc = "After being jailed for life in 1964, Nelson Mandela became a worldwide symbol of resistance to apartheid. But his opposition to racism began many years before."
print lda[other_doc]

Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-  packages/gensim/models/ldamodel.py", line 714, in __getitem__
gamma, _ = self.inference([bow])
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site
packages/gensim/models/ldamodel.py", line 361, in inference ids = [id for id, _ in doc]
ValueError: need more than 1 value to unpack

我还尝试以 3 种不同的方式使用 LdaMulticore :

lda = gensim.models.LdaMulticore(corpus, id2word=dictionary, num_topics=100, workers=3)
lda = gensim.models.ldamodel.LdaMulticore(corpus, id2word=dictionary, num_topics=100, workers=3)
lda = models.LdaMulticore(corpus, id2word=dictionary, num_topics=100, workers=3)

每次我收到此错误时:

Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'module' object has no attribute ‘LdaMulticore'

有什么想法吗?

提前谢谢您。

最佳答案

你必须转换回相空间。

http://radimrehurek.com/gensim/tut3.html#similarity-interface

vec_bow = dictionary.doc2bow(other_doc.lower().split())
vec_lsi = lda[vec_bow] # convert the query to LSI space

关于python - 将潜在狄利克雷分配与 Gensim 结合使用,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/26977042/

相关文章:

python - request.POST 中的每个值都是一个列表

java - 使用 MALLET 训练的 LDA 模型的奇怪困惑值

Python gensim LDA : add the topic to the document after getting the topics

python - gensim dovecs.doctags 索引不正确

python - 如何通过 gensim 将训练集的分布保存在经过训练的 LDA 模型上?

neural-network - 继续训练 Doc2Vec 模型

python - 在一行中获取 lambda 的最大值及其键

Python 单元测试 - 设置警告 : ResourceWarning

python - 格式化 Pandas 数据框中的时间戳

python - 如何解决加载gensim语料库时出现unpicklingerror? - Python