python-2.7 - 为Python机器学习(朴素贝叶斯)算法创建特征字典

标签 python-2.7 dictionary machine-learning nltk feature-extraction

例如,我想使用姓氏来预测华人与非华人的种族。特别是我想从姓氏中提取三个字母的子字符串。例如,姓氏“gao”将给出一个特征“gao”,而“chan”将给出两个特征“cha”和“han”。

在下面的 Three_split 函数中成功完成了拆分。但据我了解,要将其合并为功能集,我需要将输出作为字典返回。有任何关于如何做到这一点的提示吗?对于“Chan”字典,字典应返回“cha”和“han”为 TRUE。

from nltk.classify import PositiveNaiveBayesClassifier
import re

chinese_names = ['gao', 'chan', 'chen', 'Tsai', 'liu', 'Lee']

nonchinese_names = ['silva', 'anderson', 'kidd', 'bryant', 'Jones', 'harris', 'davis']

def three_split(word):
    word = word.lower()
    word = word.replace(" ", "_")
    split = 3
    return [word[start:start+split] for start in range(0, len(word)-2)]

positive_featuresets = list(map(three_split, chinese_names))
unlabeled_featuresets = list(map(three_split, nonchinese_names))
classifier = PositiveNaiveBayesClassifier.train(positive_featuresets, 
    unlabeled_featuresets)

print three_split("Jim Silva")
print classifier.classify(three_split("Jim Silva"))

最佳答案

这是一个白盒答案:

使用您的原始代码,它输出:

Traceback (most recent call last):
  File "test.py", line 17, in <module>
    unlabeled_featuresets)
  File "/usr/local/lib/python2.7/dist-packages/nltk/classify/positivenaivebayes.py", line 108, in train
    for fname, fval in featureset.items():
AttributeError: 'list' object has no attribute 'items'

查看第 17 行:

classifier = PositiveNaiveBayesClassifier.train(positive_featuresets, 
    unlabeled_featuresets)

看起来PositiveNaiveBayesClassifier需要一个具有属性'.items()'的对象,直观上它应该是一个dict如果NLTK 代码是 pythonic。

查看https://github.com/nltk/nltk/blob/develop/nltk/classify/positivenaivebayes.py#L88 ,对于 positive_featuresets 参数应包含的内容没有任何明确的解释:

:param positive_featuresets: A list of featuresets that are known as positive examples (i.e., their label is True).

检查文档字符串,我们看到这个示例:

Example:
    >>> from nltk.classify import PositiveNaiveBayesClassifier
Some sentences about sports:
    >>> sports_sentences = [ 'The team dominated the game',
    ...                      'They lost the ball',
    ...                      'The game was intense',
    ...                      'The goalkeeper catched the ball',
    ...                      'The other team controlled the ball' ]
Mixed topics, including sports:
    >>> various_sentences = [ 'The President did not comment',
    ...                       'I lost the keys',
    ...                       'The team won the game',
    ...                       'Sara has two kids',
    ...                       'The ball went off the court',
    ...                       'They had the ball for the whole game',
    ...                       'The show is over' ]
The features of a sentence are simply the words it contains:
    >>> def features(sentence):
    ...     words = sentence.lower().split()
    ...     return dict(('contains(%s)' % w, True) for w in words)
We use the sports sentences as positive examples, the mixed ones ad unlabeled examples:
    >>> positive_featuresets = list(map(features, sports_sentences))
    >>> unlabeled_featuresets = list(map(features, various_sentences))
    >>> classifier = PositiveNaiveBayesClassifier.train(positive_featuresets,
    ...                                                 unlabeled_featuresets)

现在我们找到了feature()函数,它将句子转换为特征并返回

dict(('contains(%s)' % w, True) for w in words)

基本上,这是能够调用 .items() 的东西。看看字典理解,似乎 'contains(%s)' % w 有点多余,除非它是为了人类可读性。所以你可以直接使用dict((w, True) for w in Words)

此外,用下划线替换空格也可能是多余的,除非稍后需要使用它。

最后,切片和有限迭代可以替换为可以提取字符 ngram 的 ngram 函数,例如

>>> word = 'alexgao'
>>> split=3
>>> [word[start:start+split] for start in range(0, len(word)-2)]
['ale', 'lex', 'exg', 'xga', 'gao']
# With ngrams
>>> from nltk.util import ngrams
>>> ["".join(ng) for ng in ngrams(word,3)]
['ale', 'lex', 'exg', 'xga', 'gao']

您的特征提取函数可以简化如下:

from nltk.util import ngrams
def three_split(word):
    return dict(("".join(ng, True) for ng in ngrams(word.lower(),3))

[输出]:

{'im ': True, 'm s': True, 'jim': True, 'ilv': True, ' si': True, 'lva': True, 'sil': True}
False

事实上,NLTK 分类器用途广泛,您可以使用字符元组作为特征,因此在提取特征时无需修补 ngram,即:

from nltk.classify import PositiveNaiveBayesClassifier
import re
from nltk.util import ngrams

chinese_names = ['gao', 'chan', 'chen', 'Tsai', 'liu', 'Lee']

nonchinese_names = ['silva', 'anderson', 'kidd', 'bryant', 'Jones', 'harris', 'davis']


def three_split(word):
    return dict(((ng, True) for ng in ngrams(word.lower(),3))

positive_featuresets = list(map(three_split, chinese_names))
unlabeled_featuresets = list(map(three_split, nonchinese_names))

classifier = PositiveNaiveBayesClassifier.train(positive_featuresets, 
    unlabeled_featuresets)

print three_split("Jim Silva")
print classifier.classify(three_split("Jim Silva"))

[输出]:

{('m', ' ', 's'): True, ('j', 'i', 'm'): True, ('s', 'i', 'l'): True, ('i', 'l', 'v'): True, (' ', 's', 'i'): True, ('l', 'v', 'a'): True, ('i', 'm', ' '): True}

关于python-2.7 - 为Python机器学习(朴素贝叶斯)算法创建特征字典,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/29437467/

相关文章:

python - Python 中的 set.copy() 是原子的吗?

c# - c# 中的字典或列表

typescript - 知道 key 的 Typescript Record 访问的复杂性是什么?

python - 混淆矩阵错误 "Classification metrics can' t 处理多标签指示符和多类目标的混合”

python - 决定包含分类变量和数值变量的数据集的聚类算法

python - networkx 删除节点属性

python-2.7 - pip、easy_install 命令在 Ubuntu 中不起作用。安装了 Python 2.7 和 3.4

python - wxpython - 将 PIL.ImageFont 转换为 wx.Font 或 wx.Bitmap

ios - 尝试像在函数中一样将字典作为参数传递

python - 打印前 10 个特征的名称及其卡方值