nlp - 我如何在斯坦福 Pos tagger 中创建自己的模型？

我想添加新的标记词(我们地区使用的本地词)并创建一个新模型。我从命令行创建了 .prop 文件，但如何创建 .tagger 文件？

当我尝试创建斯坦福网站上提到的此类文件时，它显示了类似的错误

"No model specified"

什么是 -model 参数，它是语料库吗？我如何将新的标记词添加到其中？

那么我该如何训练标注器呢？

You need to start with a .props file which contains options for the tagger to use. The .props files we used to create the sample taggers are included in the models directory; you can start from whichever one seems closest to the language you want to tag.

For example, to train a new English tagger, start with the left3words tagger props file. To train a tagger for a western language other than English, you can consider the props files for the German or the French taggers, which are included in the full distribution. For languages using a different character set, you can start from the Chinese or Arabic props files. Or you can use the -genprops option to MaxentTagger, and it will write a sample properties file, with documentation, for you to modify. It writes it to stdout, so you'll want to save it to some file by redirecting output (usually with >). The # at the start of the line makes things a comment, so you'll want to delete the # before properties you wish to specify.

最佳答案

这里有两个可以为您提供帮助的链接，描述了如何创建(训练)标记器的分步说明:

请注意，在 .conf 文件中，您应该指向您的树库(即，以具有 POS 标记和依赖关系的依存树格式解析的现实世界句子)。在同一行中，您应该指定您的格式:

TEXT//表示由文本分隔的标记化文件
TSV//表示 tsv 文件，例如 conll 文件
TREES//表示 PTB 格式的文件

就我而言，我使用了 CoNLL 文件，它是制表符分隔值格式 (TSV)。我必须承认，找不到明确的文档，不得不求助于源代码。

我的配置:

model = portuguese.tagger
arch = left3words,naacl2003unknowns,allwordshapes(-1,1)
trainFile = format=TSV,wordColumn=1,tagColumn=4,C:\\path\\universal-dev.conll
closedClassTagThreshold = 40
curWordMinFeatureThresh = 2
tagSeparator = _
encoding = utf-8   # that's because I based my config on spanish!
iterations = 100
lang = spanish
learnClosedClassTags = false
minFeatureThresh = 2
openClassTags = 
rareWordMinFeatureThresh = 10
rareWordThresh = 5
search = qn
sgml = false
sigmaSquared = 0.0
regL1 = 0.75
tokenize = true
tokenizerOptions = asciiQuotes
verbose = false
verboseResults = false
veryCommonWordThresh = 250
xmlInput = null
outputFormat = slashTags
nthreads = 16

关于nlp - 我如何在斯坦福 Pos tagger 中创建自己的模型？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/27086260/

nlp - 我如何在斯坦福 Pos tagger 中创建自己的模型？

上一篇：wpf - 如何更改 MahApps.Metro 对话框内容模板宽度？

下一篇：.net - 如何使用 Nant 构建类库