machine-learning - 如何为针对命名实体识别的分类器形成特征向量?

标签 machine-learning language-agnostic nlp

我有一组标签(不同于常规的名称、地点、对象等)。就我而言,它们是特定于领域的,我将它们称为:实体、操作、事件。我想使用它们作为提取更多命名实体的种子。

我偶然发现了这篇论文:“用于命名实体识别的高效支持向量分类器”,作者:Isozaki 等人。虽然我喜欢使用支持向量机进行命名实体识别的想法,但我一直困惑于如何对特征向量进行编码。对于他们的论文,他们是这样说的:

For instance, the words in “President George Herbert Bush said Clinton is . . . ” are classified as follows: “President” = OTHER, “George” = PERSON-BEGIN, “Herbert” = PERSON-MIDDLE, “Bush” = PERSON-END, “said” = OTHER, “Clinton” = PERSON-SINGLE, “is” = OTHER. In this way, the first word of a person’s name is labeled as PERSON-BEGIN. The last word is labeled as PERSON-END. Other words in the name are PERSON-MIDDLE. If a person’s name is expressed by a single word, it is labeled as PERSON-SINGLE. If a word does not belong to any named entities, it is labeled as OTHER. Since IREX de- fines eight NE classes, words are classified into 33 categories.

Each sample is represented by 15 features because each word has three features (part-of-speech tag, character type, and the word itself), and two preceding words and two succeeding words are also used for context dependence. Although infrequent features are usually removed to prevent overfitting, we use all features because SVMs are robust. Each sample is represented by a long binary vector, i.e., a sequence of 0 (false) and 1 (true). For instance, “Bush” in the above example is represented by a vector x = x[1] ... x[D] described below. Only 15 elements are 1.

x[1] = 0 // Current word is not ‘Alice’ 
x[2] = 1 // Current word is ‘Bush’ 
x[3] = 0 // Current word is not ‘Charlie’

x[15029] = 1 // Current POS is a proper noun 
x[15030] = 0 // Current POS is not a verb

x[39181] = 0 // Previous word is not ‘Henry’ 
x[39182] = 1 // Previous word is ‘Herbert

我不太明白这里的二进制向量是如何构造的。我知道我错过了一个微妙的点,但有人可以帮助我理解这一点吗?

最佳答案

他们省略了一袋词库构建步骤。

基本上,您已经从训练集中的(非罕见)单词到索引构建了一个映射。假设您的训练集中有 20k 个独特的单词。您将从训练集中的每个单词映射到 [0, 20000]。

然后,特征向量基本上是几个非常稀疏的向量的串联,其中 1 对应于特定单词,19,999 个 0,然后 1 对应于特定 POS,另外 50 个 0 对应于非事件 POS。这通常称为单热编码。 http://en.wikipedia.org/wiki/One-hot

def encode_word_feature(word, POStag, char_type, word_index_mapping, POS_index_mapping, char_type_index_mapping)):
  # it makes a lot of sense to use a sparsely encoded vector rather than dense list, but it's clearer this way
  ret = empty_vec(len(word_index_mapping) + len(POS_index_mapping) + len(char_type_index_mapping))
  so_far = 0
  ret[word_index_mapping[word] + so_far] = 1
  so_far += len(word_index_mapping)
  ret[POS_index_mapping[POStag] + so_far] = 1
  so_far += len(POS_index_mapping)
  ret[char_type_index_mapping[char_type] + so_far] = 1
  return ret

def encode_context(context):
  return encode_word_feature(context.two_words_ago, context.two_pos_ago, context.two_char_types_ago, 
             word_index_mapping, context_index_mapping, char_type_index_mapping) +
         encode_word_feature(context.one_word_ago, context.one_pos_ago, context.one_char_types_ago, 
             word_index_mapping, context_index_mapping, char_type_index_mapping) + 
         # ... pattern is obvious

因此,您的特征向量大小约为 100k,其中 POS 和 char 标签有一点额外的大小,并且几乎完全是 0,除了根据您的特征到索引映射选取的位置中的 15 个 1 之外。

关于machine-learning - 如何为针对命名实体识别的分类器形成特征向量?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/8219772/

相关文章:

math - 超平面之间的距离

machine-learning - 深度神经网络的非线性意味着什么?

tensorflow - TensorFlow SparseCategoricalCrossentropy 如何工作?

language-agnostic - 为自动完成场景表示语言标记的最佳方式

python - 蒙版张量损失

machine-learning - 在自然语言处理中是否有减少词汇量的好方法?

python - TextBlob 朴素贝叶斯。选择最高可能性

linux - 减少读取许多小文件时的查找时间

language-agnostic - 怎么能解释代码甚至效率低下呢? (理论)

python - 逆文档频率公式