python - 使用 python -m gensim.scripts.make_wiki 将 Wikipedia 转储转换为文本

标签 python wikipedia gensim

我想使用 gensim 使用 python -m gensim.scripts.make_wiki 脚本将 Wikipedia 转储转换为纯文本。

我用它作为:

python -m gensim.scripts.make_wiki ./enwiki-latest-pages-articles.xml.bz2 ./results

最后给我一个错误:

2016-04-06 20:43:46,471 : INFO : storing corpus in Matrix Market format to ./results/_bow.mm
Traceback (most recent call last):
  File "/usr/lib/python2.7/runpy.py", line 162, in _run_module_as_main
    "__main__", fname, loader, pkg_name)
  File "/usr/lib/python2.7/runpy.py", line 72, in _run_code
    exec code in run_globals
  File "/usr/local/lib/python2.7/dist-packages/gensim-0.12.3-py2.7-linux-x86_64.egg/gensim/scripts/make_wiki.py", line 88, in <module>
    MmCorpus.serialize(outp + '_bow.mm', wiki, progress_cnt=10000) # another ~9h
  File "/usr/local/lib/python2.7/dist-packages/gensim-0.12.3-py2.7-linux-x86_64.egg/gensim/corpora/indexedcorpus.py", line 89, in serialize
    offsets = serializer.save_corpus(fname, corpus, id2word, progress_cnt=progress_cnt, metadata=metadata)
  File "/usr/local/lib/python2.7/dist-packages/gensim-0.12.3-py2.7-linux-x86_64.egg/gensim/corpora/mmcorpus.py", line 49, in save_corpus
    return matutils.MmWriter.write_corpus(fname, corpus, num_terms=num_terms, index=True, progress_cnt=progress_cnt, metadata=metadata)
  File "/usr/local/lib/python2.7/dist-packages/gensim-0.12.3-py2.7-linux-x86_64.egg/gensim/matutils.py", line 486, in write_corpus
    mw = MmWriter(fname)
  File "/usr/local/lib/python2.7/dist-packages/gensim-0.12.3-py2.7-linux-x86_64.egg/gensim/matutils.py", line 436, in __init__
    self.fout = utils.smart_open(self.fname, 'wb+') # open for both reading and writing
  File "build/bdist.linux-x86_64/egg/smart_open/smart_open_lib.py", line 111, in smart_open
NotImplementedError: unknown file mode wb+

有人知道这是怎么回事吗?

最佳答案

不确定命令行脚本,但以下内容对我有用 -

def parse_wiki(wiki_bz_file):
    output = open('./wiki_text_dump.txt', 'w')
    i = 0
    wiki = WikiCorpus(wiki_bz_file, lemmatize=False, dictionary={}) #vocab dict not needed
    for text in wiki.get_texts():
        output.write(u.listToStr(chunk) + '\n')
        i = i + 1
        if i%50000 == 0:
            logger.info("Saved " + str(i) + " articles")
    output.close()
    logger.info("Finished Saved " + str(i) + " articles")
    return

关于python - 使用 python -m gensim.scripts.make_wiki 将 Wikipedia 转储转换为文本,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/36462394/

相关文章:

python - 编辑 NLTK 语料库

python - 如何使用经过训练的 LDA 模型使用 gensim 预测新查询的主题?

c - 数字滤波算法

nlp - 如何在 gensim 中使用 build_vocab?

python - 使用 Mallet Perplexity 进行 Gensim 主题建模

python - 将字典的键值对分配用作 for 循环中的迭代器

python - 从 tf.data 中仅提取 numpy 数组的一部分

python - 具有 NUMERIC(10,2) 等参数化数据类型的 Pandas to_sql

php - 从维基百科 API 中提取数据

java - 在本地服务器上使用 freebase 数据?