nlp - 我如何在斯坦福 Pos tagger 中创建自己的模型?

标签 nlp stanford-nlp pos-tagger

我想添加新的标记词(我们地区使用的本地词)并创建一个新模型。我从命令行创建了 .prop 文件,但如何创建 .tagger 文件?

当我尝试创建斯坦福网站上提到的此类文件时,它显示了类似的错误

"No model specified"

什么是 -model 参数,它是语料库吗?我如何将新的标记词添加到其中?

那么我该如何训练标注器呢?

Stanford site说:

You need to start with a .props file which contains options for the tagger to use. The .props files we used to create the sample taggers are included in the models directory; you can start from whichever one seems closest to the language you want to tag.

For example, to train a new English tagger, start with the left3words tagger props file. To train a tagger for a western language other than English, you can consider the props files for the German or the French taggers, which are included in the full distribution. For languages using a different character set, you can start from the Chinese or Arabic props files. Or you can use the -genprops option to MaxentTagger, and it will write a sample properties file, with documentation, for you to modify. It writes it to stdout, so you'll want to save it to some file by redirecting output (usually with >). The # at the start of the line makes things a comment, so you'll want to delete the # before properties you wish to specify.

最佳答案

这里有两个可以为您提供帮助的链接,描述了如何创建(训练)标记器的分步说明:

  1. https://medium.com/@klintcho/training-a-swedish-pos-tagger-for-stanford-corenlp-546e954a8ee7
  2. http://www.florianboudin.org/wiki/doku.php?id=nlp_tools_related&DokuWiki=9d6b70b2ee818e600edc0359e3d7d1e8

请注意,在 .conf 文件中,您应该指向您的树库(即,以具有 POS 标记和依赖关系的依存树格式解析的现实世界句子)。在同一行中,您应该指定您的格式:

  1. TEXT//表示由文本分隔的标记化文件
  2. TSV//表示 tsv 文件,例如 conll 文件
  3. TREES//表示 PTB 格式的文件

就我而言,我使用了 CoNLL 文件,它是制表符分隔值格式 (TSV)。我必须承认,找不到明确的文档,不得不求助于源代码。

我的配置:

model = portuguese.tagger
arch = left3words,naacl2003unknowns,allwordshapes(-1,1)
trainFile = format=TSV,wordColumn=1,tagColumn=4,C:\\path\\universal-dev.conll
closedClassTagThreshold = 40
curWordMinFeatureThresh = 2
tagSeparator = _
encoding = utf-8   # that's because I based my config on spanish!
iterations = 100
lang = spanish
learnClosedClassTags = false
minFeatureThresh = 2
openClassTags = 
rareWordMinFeatureThresh = 10
rareWordThresh = 5
search = qn
sgml = false
sigmaSquared = 0.0
regL1 = 0.75
tokenize = true
tokenizerOptions = asciiQuotes
verbose = false
verboseResults = false
veryCommonWordThresh = 250
xmlInput = null
outputFormat = slashTags
nthreads = 16

关于nlp - 我如何在斯坦福 Pos tagger 中创建自己的模型?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/27086260/

相关文章:

java - 运行斯坦福 NER 模型 3.6.0 时出错

nlp - 在StanfordCoreNLP中设置句子的最大长度

machine-learning - SkipGram 中上下文词的表示矩阵是什么意思?

java - 删除 POS 标注器的标签

c++ - C++ 中用于 NLP 的现有 API?

python - 在大多数情况下,VADER polar_scores返回输出为“Neutral”

machine-learning - 自定义词汇上的 Sklearn Countvectorizer

text - 围绕主题聚集短语

machine-learning - 即使对于用于训练它的文件,libSVM 也会给出非常不准确的预测

r - 为什么 ngrams() 函数给出不同的二元组?