nlp - 如何使用庞大的语言模型调整机器翻译模型?

标签 nlp n-gram machine-translation moses language-model

Moses是一款构建机器翻译模型的软件。和 KenLM是摩西使用的事实上的语言模型软件。

我有一个包含 16GB 文本的文本文件,我用它来构建一个语言模型:

bin/lmplz -o 5 <text > text.arpa

生成的文件 ( text.arpa ) 为 38GB。然后我将语言模型二值化为:
bin/build_binary text.arpa text.binary

二值化语言模型 (text.binary) 增长到 71GB。

moses , 训练翻译模型后,您应该使用 MERT 调整模型的权重算法。这可以简单地通过 https://github.com/moses-smt/mosesdecoder/blob/master/scripts/training/mert-moses.pl 完成.

MERT 在小语言模型上运行良好,但对于大语言模型,则需要几天时间才能完成。

我做了一个谷歌搜索,找到了 KenLM 的过滤器,它 promise 将语言模型过滤到更小的尺寸:https://kheafield.com/code/kenlm/filter/

但我对如何使它起作用一无所知。命令帮助提供:
$ ~/moses/bin/filter
Usage: /home/alvas/moses/bin/filter mode [context] [phrase] [raw|arpa] [threads:m] [batch_size:m] (vocab|model):input_file output_file

copy mode just copies, but makes the format nicer for e.g. irstlm's broken
    parser.
single mode treats the entire input as a single sentence.
multiple mode filters to multiple sentences in parallel.  Each sentence is on
    a separate line.  A separate file is created for each sentence by appending
    the 0-indexed line number to the output file name.
union mode produces one filtered model that is the union of models created by
    multiple mode.

context means only the context (all but last word) has to pass the filter, but
    the entire n-gram is output.

phrase means that the vocabulary is actually tab-delimited phrases and that the
    phrases can generate the n-gram when assembled in arbitrary order and
    clipped.  Currently works with multiple or union mode.

The file format is set by [raw|arpa] with default arpa:
raw means space-separated tokens, optionally followed by a tab and arbitrary
    text.  This is useful for ngram count files.
arpa means the ARPA file format for n-gram language models.

threads:m sets m threads (default: conccurrency detected by boost)
batch_size:m sets the batch size for threading.  Expect memory usage from this
    of 2*threads*batch_size n-grams.

There are two inputs: vocabulary and model.  Either may be given as a file
    while the other is on stdin.  Specify the type given as a file using
    vocab: or model: before the file name.  

For ARPA format, the output must be seekable.  For raw format, it can be a
    stream i.e. /dev/stdout

但是当我尝试以下操作时,它卡住了,什么也不做:
$ ~/moses/bin/filter union lm.en.binary lm.filter.binary
Assuming that lm.en.binary is a model file
Reading lm.en.binary
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100

二值化后的语言模型应该怎么做?是否有任何其他步骤来操作大型语言模型以减少
调整时的计算负载?


调整大型 LM 文件的常用方法是什么?

如何使用 KenLM 的过滤器?

(更多详情请见 https://www.mail-archive.com/moses-support@mit.edu/msg12089.html)

最佳答案

解答使用方法 filter KenLM的命令

cat small_vocabulary_one_word_per_line.txt \
  | filter single \
         "model:LM_large_vocab.arpa" \
          output_LM_small_vocab.

注意:那single可以替换为 unioncopy .如果您运行 filter,请阅读正在打印的帮助中的更多信息。没有参数的二进制。

关于nlp - 如何使用庞大的语言模型调整机器翻译模型?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/29869607/

相关文章:

java - 在Java中使用POSTagger将不同的POS(词性)保存在不同的文件中?

用于新语言的基于 python 的朴素贝叶斯分类器

python - 高效的叠瓦算法

elasticsearch - 如何获得Elasticsearch为匹配顺序的 token 字符串分配更高的分数?

node.js - 在 Microsoft Luis 中,如何对实体数组建模?

nlp - 如何将 CLAWS7 标签转换为 Penn 标签?

python - 部分搜索返回零命中

TensorFlow:nr。时代与天然橡胶。训练步骤

machine-learning - 子词 NMT 的 BLEU 分数应该根据子词计算还是应该先连接?

text-to-speech - 谷歌的文本到语音引擎的声音?