python - 如何在 Gensim 主题建模上预测测试数据

标签 python jupyter-notebook gensim topic-modeling mallet

我已经使用 Gensim LDAMallet 进行主题建模,但是我们可以通过什么方式预测示例段落并使用预训练模型获得其主题模型。

# Build the bigram and trigram models
bigram = gensim.models.Phrases(t_preprocess(dataset.data), min_count=5, threshold=100)
bigram_mod = gensim.models.phrases.Phraser(bigram) 

def make_bigrams(texts):
   return [bigram_mod[doc] for doc in texts]

data_words_bigrams = make_bigrams(t_preprocess(dataset.data))

# Create Dictionary
id2word = corpora.Dictionary(data_words_bigrams)

# Create Corpus
texts = data_words_bigrams

# Term Document Frequency
corpus = [id2word.doc2bow(text) for text in texts]

mallet_path='/home/riteshjain/anaconda3/mallet/mallet2.0.8/bin/mallet' 
ldamallet = gensim.models.wrappers.LdaMallet(mallet_path,corpus=corpus, num_topics=12, id2word=id2word, random_seed = 0)

coherence_model_ldamallet = CoherenceModel(model=ldamallet, texts=texts, dictionary=id2word, coherence='c_v')

a = "When Honda builds a hybrid, you've got to be sure it’s a marvel. And an Accord Hybrid is when technology surpasses the known and takes a leap of faith into tomorrow. This is the next generation Accord, the ninth generation to be precise."

如何使用此文本 (a) 从预训练模型中获取其主题。请帮忙。

最佳答案

您需要像训练集那样处理“a”:

# import a new data set to be passed through the pre-trained LDA

data_new = pd.read_csv('YourNew.csv', encoding = "ISO-8859-1");
data_new = data_new.dropna()
data_text_new = data_new[['Your Target Column']]
data_text_new['index'] = data_text_new.index

documents_new = data_text_new

# process the new data set through the lemmatization, and stopwork functions

def preprocess(text):
    result = []
    for token in gensim.utils.simple_preprocess(text):
        if token not in gensim.parsing.preprocessing.STOPWORDS and len(token) > 3:
            nltk.bigrams(token)
            result.append(lemmatize_stemming(token))
    return result

processed_docs_new = documents_new['Your Target Column'].map(preprocess)

# create a dictionary of individual words and filter the dictionary
dictionary_new = gensim.corpora.Dictionary(processed_docs_new[:])
dictionary_new.filter_extremes(no_below=15, no_above=0.5, keep_n=100000)

# define the bow_corpus
bow_corpus_new = [dictionary_new.doc2bow(doc) for doc in processed_docs_new]

然后你可以将它作为函数传递:

a = ldamallet[bow_corpus_new[:len(bow_corpus_new)]]
b = data_text_new

topic_0=[]
topic_1=[]
topic_2=[]

for i in a:
    topic_0.append(i[0][1])
    topic_1.append(i[1][1])
    topic_2.append(i[2][1])
    
d = {'Your Target Column': b['Your Target Column'].tolist(),
     'topic_0': topic_0,
     'topic_1': topic_1,
     'topic_2': topic_2}
     
df = pd.DataFrame(data=d)
df.to_csv("YourAllocated.csv", index=True, mode = 'a')

我希望这有帮助:)

关于python - 如何在 Gensim 主题建模上预测测试数据,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/55789477/

相关文章:

python - 将 word2vec bin 文件转换为文本

ubuntu - 无法使用 Ubuntu16.04 和 anaconda 安装 gensim

python-3.x - 如何在 google Dataproc 上安装 Jupyter notebook

python - 来自文件的数据流与文件目录的性能

javascript - 按单元格而不是按行保存 # : IPython %save magic: Is there a way?

python - 解决超定系统最小二乘法的最快方法

python - Dash by Plotly vs Jupyter Dashboards 的优缺点是什么?

python - 如何在 vs 代码(ipynb)中解决这个问题?

python - 为什么文件权限在 Python 和 bash 中显示不同?

python - reshape DataFrame - 将具有重复项的列值转换为列标题