python - 实现 LDA 的替代形式

我正在对来自六个不同来源的新闻数据语料库使用 Latent Dirichlet Allocation。我对主题的演变、出现感兴趣，并且想比较不同时间来源之间的相同点和不同点。我知道有许多改进的 LDA 算法，例如 Author-Topic 模型、Topics Over Time 等。

我的问题是，这些替代模型规范中很少有以任何标准格式实现的。一些在 Java 中可用，但大多数仅作为 session 论文存在。自己实现其中一些算法的最佳方法是什么？我相当精通 R 和 jags，并且在足够长的时间里可以在 Python 中跌跌撞撞。我愿意写代码，但我真的不知道从哪里开始，我不知道 C 或 Java。我可以仅使用手稿中的公式在 JAGS 或 Python 中构建模型吗？如果是这样，有人可以指出我这样做的例子吗？谢谢。

最佳答案

下面是我 friend 的回复，语言不通请见谅

First I wrote up a Python implementation of the collapsed Gibbs sampler seen here (http://www.pnas.org/content/101/suppl.1/5228.full.pdf+html) and fleshed out here (http://cxwangyi.files.wordpress.com/2012/01/llt.pdf). This was slow as balls.

Then I used a Python wrapping of a C implementation of this paper (http://books.nips.cc/papers/files/nips19/NIPS2006_0511.pdf). Which is fast as f*ck, but the results are not as great as one would see with NMF.

But NMF implementations I've seen, with scitkits, and even with the scipy sparse-compatible recently released NIMFA library, they all blow the f*ck up on any sizable corpus. My new white whale is a sliced, distributed implementation of the thing. This'll be non-trivial.

关于python - 实现 LDA 的替代形式，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/10112500/

python - 实现 LDA 的替代形式

上一篇：python - 为什么 'object' 类没有用户设置属性

下一篇：java - Python 在接收行的末尾添加一个额外的 CR