machine-learning - AdaGram.jl 上训练文本的问题

标签 machine-learning julia word2vec

我是 Julia 编程语言的新手。我正在尝试在我的计算机上安装 Adaptive Skip-gram (AdaGram) 模型。我面临以下问题。在训练模型之前，我们需要标记化文件和字典文件。现在我的问题是，应该为 tokenize.sh 和dictionary.sh 提供什么输入。请让我知道生成输出文件的实际方式及其扩展名。

这是我指的网站链接:https://github.com/sbos/AdaGram.jl 。这与 https://code.google.com/p/word2vec/ 完全相同

最佳答案

该包提供了一些 shell 脚本来预处理数据并拟合模型: 你必须从 shell 调用它们，即在 Julia 之外。

# Install the package
julia -e 'Pkg.clone("https://github.com/sbos/AdaGram.jl.git")'
julia -e 'Pkg.build("AdaGram")'

# Download some text
wget http://www.gutenberg.org/ebooks/100.txt.utf-8

# Tokenize the text, and count the words
~/.julia/v0.3/AdaGram/utils/tokenize.sh 100.txt.utf-8 text.txt
~/.julia/v0.3/AdaGram/utils/dictionary.sh text.txt dictionary.txt

# Train the model
~/.julia/v0.3/AdaGram/train.sh text.txt dictionary.txt model

然后您可以使用来自 Julia 的模型:

using AdaGram
vm, dict = load_model("model");
expected_pi(vm, dict.word2id["hamlet"])
nearest_neighbors(vm, dict, "hamlet", 1, 10)

关于machine-learning - AdaGram.jl 上训练文本的问题，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/30002329/

上一篇：machine-learning - 神经网络的批量大小是多少？

下一篇：image - Vowpal Wabbit 模型在使用像素 RGB 值对图像进行多类分类时表现不佳

python - fastai - 绘图验证和训练准确性

node.js - 通过 rest api 训练 Microsoft Custom Vision 模型

visual-studio-code - PyPlot 图未在 VS Code Jupyter 中显示 "UserWarning: Matplotlib is currently using agg, which is a non-GUI backend, so cannot show the figure."

deep-learning - 如何从特征向量或单词生成句子？

machine-learning - 关于 keras 示例 pretrained_word_embeddings 的问题

dataframe - Query.jl - 创建一个新列并立即使用

graph - 如何使用 Julia 图阻止网格自动缩放

python - 为什么在 gensim word2vec 中创建多个模型文件？

apache-spark - Spark Word2VecModel 超过了用于保存的最大 RPC 大小