python - 自然语言处理 : text corpus format for word2vec

标签 python c++ rest nlp word2vec

我找到了一个在大型维基百科数据集上使用 word2vec 的教程 http://danielfrg.github.io/blog/2013/09/21/word2vec-yhat/
我想构建一个类似于 Daniel 在他的教程中演示的 yhat rest API。

今天我整理了一些我想分析的西类牙报纸文章。我检索数据的网站非常定期地格式化其文章，因此我将 1000 篇文章存储为字符串，例如

"Otros se dan a conocer por la simpleza, como Sonya Cortés, 
quien expresó que atesora compartir en familia y gozar de salud.   
En el ambiente del reggaeton, Khriz, del dúo Ángel & Khriz, 
aprovechará para estrenar su nueva piscina ya que por su agenda 
de trabajo no ha podido darse un chapuzón todavía. Mientras, 
Daddy Yankee se tomará un descanso con la familia luego de una larga gira."

我对 Python 很满意，并希望使用教程中列出的 python 包装器: https://github.com/danielfrg/word2vec

如何将我的语料库加载到 word2vec 中？现在我有一个字符串数组。

目前我的语料库适合内存。 word2vec 仍然是正确的工具吗？

最佳答案

如果通过

Right now I have an array of strings

你的意思是它已经被标记化了。

sentences = gensim.models.word2vec.LineSentence(path_to_corpus)
model = gensim.models.Word2Vec(sentences, min_count=10, size=500, window=10, sg=1, workers=4)

句子必须是字符串列表，即:

[ ['this', 'is' , 'my', 'first', 'sentence'], ['this', 'is', 'the', 'second']]

关于python - 自然语言处理 : text corpus format for word2vec，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/20276264/

上一篇：c++ - 由于可能的松散指针而导致的段错误

下一篇：c# - 在非托管 C++ DLL 和托管 C# UI 之间发送信息

相关文章：

c++ - 当 float 转换为 int 时，此代码中如何/为什么会发生缩小

rest - MvC 中 spring security 中每个路径/路由的不同 AuthenticationManager

python - 调用StartInstances操作时发生客户端错误(UnauthorizedOperation)

python - 模块未找到错误 : No module named 'MySQLdb' python macOS high sierra

Android NDK 使用多核 cpu

java - Rest API 最佳实践 : Multiple parameters search filter API implementation

node.js - Keycloak:使用 nodeJS 的 authZ

python - 忽略 scipy NoConvergence

python - 仅将 tf.nn.softmax() 应用于张量的正元素

C++内存对齐问题