topic-modeling - Mallet 主题模型 - 结果与序列化文件不一致

标签 topic-modeling mallet

我用 Mallet 训练了一个主题模型,我想将其序列化以供以后使用。我在两个测试文档上运行它,然后反序列化并在相同文档上运行加载的模型,结果完全不同。

我保存/加载文档的方式有什么问题(附代码)吗?

谢谢!

List<Pipe> pipeList = initPipeList();
// Begin by importing documents from text to feature sequences

InstanceList instances = new InstanceList(new SerialPipes(pipeList));

for (String document : documents) {
    Instance inst = new Instance(document, "","","");
    instances.addThruPipe(inst);
}

ParallelTopicModel model = new ParallelTopicModel(numTopics, alpha_t * numTopics, beta_w);
model.addInstances(instances);
model.setNumThreads(numThreads);
model.setNumIterations(numIterations);
model.estimate();

printProbabilities(model, "doc 1"); // I replaced the contents of the docs due to copywrite issues
printProbabilities(model, "doc 2");

model.write(new File("model.bin"));
model = ParallelTopicModel.read("model.bin");

printProbabilities(model, "doc 1");
printProbabilities(model, "doc 2");

printProbabilities() 的定义:

public void printProbabilities(ParallelTopicModel model, String doc) {

    List<Pipe> pipeList = initPipeList();

    InstanceList instances = new InstanceList(new SerialPipes(pipeList));
    instances.addThruPipe(new Instance(doc, "", "", ""));

    double[] probabilities = model.getInferencer().getSampledDistribution(instances.get(0), 10, 1, 5);

    for (int i = 0; i < probabilities.length; i++) {
        double probability = probabilities[i];
        if (probability > 0.01) {
            System.out.println("Topic " + i + ", probability: " + probability);
        }
    }
}

最佳答案

您必须使用相同的管道进行训练和分类。在训练期间,管道的数据字母表会随着每个训练实例而更新。您不会使用 new SerialPipe(pipeList) 生成相同的管道,因为它的数据字母表为空。保存/加载管道或包含管道和模型的实例列表,并使用该管道添加测试实例。

关于topic-modeling - Mallet 主题模型 - 结果与序列化文件不一致,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/26852237/

相关文章:

machine-learning - 如何使用 Mallet 命令行提示符报告精确度和召回率分数?

python - pyLDAvis : Validation error on trying to visualize topics

R:LDA Topicmodels - 术语的分布在哪里?

java - Mallet:OutOfMemoryError:Java 堆空间

java - 使用对数似然比较不同的木槌主题模型?

java - Mallet 文档分类 - 减少词汇量

gensim - pyLDAvis 与 Mallet LDA 实现 : LdaMallet object has no attribute 'inference'

python - 使用 gensim 将 LDA 应用于语料库进行训练

python - LDA Gensim Word -> 主题 ID 分布而不是主题 -> 单词分布

python - gensim LdaMallet 引发 CalledProcessError,但在命令行运行 mallet 时没有错误