nlp - 我如何在斯坦福 Pos tagger 中创建自己的模型?

我想添加新的标记词(我们地区使用的本地词)并创建一个新模型。我从命令行创建了 .prop 文件,但如何创建 .tagger 文件?


"No model specified"

什么是 -model 参数,它是语料库吗?我如何将新的标记词添加到其中?


Stanford site说:

You need to start with a .props file which contains options for the tagger to use. The .props files we used to create the sample taggers are included in the models directory; you can start from whichever one seems closest to the language you want to tag.

For example, to train a new English tagger, start with the left3words tagger props file. To train a tagger for a western language other than English, you can consider the props files for the German or the French taggers, which are included in the full distribution. For languages using a different character set, you can start from the Chinese or Arabic props files. Or you can use the -genprops option to MaxentTagger, and it will write a sample properties file, with documentation, for you to modify. It writes it to stdout, so you'll want to save it to some file by redirecting output (usually with >). The # at the start of the line makes things a comment, so you'll want to delete the # before properties you wish to specify.




请注意,在 .conf 文件中,您应该指向您的树库(即,以具有 POS 标记和依赖关系的依存树格式解析的现实世界句子)。在同一行中,您应该指定您的格式:

  1. TEXT//表示由文本分隔的标记化文件
  2. TSV//表示 tsv 文件,例如 conll 文件
  3. TREES//表示 PTB 格式的文件

就我而言,我使用了 CoNLL 文件,它是制表符分隔值格式 (TSV)。我必须承认,找不到明确的文档,不得不求助于源代码。


model = portuguese.tagger
arch = left3words,naacl2003unknowns,allwordshapes(-1,1)
trainFile = format=TSV,wordColumn=1,tagColumn=4,C:\\path\\universal-dev.conll
closedClassTagThreshold = 40
curWordMinFeatureThresh = 2
tagSeparator = _
encoding = utf-8   # that's because I based my config on spanish!
iterations = 100
lang = spanish
learnClosedClassTags = false
minFeatureThresh = 2
openClassTags = 
rareWordMinFeatureThresh = 10
rareWordThresh = 5
search = qn
sgml = false
sigmaSquared = 0.0
regL1 = 0.75
tokenize = true
tokenizerOptions = asciiQuotes
verbose = false
verboseResults = false
veryCommonWordThresh = 250
xmlInput = null
outputFormat = slashTags
nthreads = 16

