我想做 n-grams 方法,但是逐个字母
普通 N 元语法:
sentence : He want to watch football match
result:
he, he want, want, want to , to , to watch , watch , watch football , football, football match, match
我想逐字逐句地这样做:
word : Angela
result:
a, an, n , ng , g , ge, e ,el, l , la ,a
这是我使用 Sklearn
的代码,但它仍然是逐字而不是逐字母:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(ngram_range=(1, 100),token_pattern = r"(?u)\b\w+\b")
corpus = ['Angel','Angelica','John','Johnson']
X = vectorizer.fit_transform(corpus)
analyze = vectorizer.build_analyzer()
print(vectorizer.get_feature_names())
print(vectorizer.transform(['Angela']).toarray())
最佳答案
有一个 'analyzer'
参数可以完成您想要的操作。
根据the documentation :-
analyzer : string, {‘word’, ‘char’, ‘char_wb’} or callable
Whether the feature should be made of word or character n-grams. Option ‘char_wb’ creates character n-grams only from text inside word boundaries; n-grams at the edges of words are padded with space.
If a callable is passed it is used to extract the sequence of features out of the raw, unprocessed input.
默认情况下,它设置为单词,您可以更改。
就这样做:
vectorizer = CountVectorizer(ngram_range=(1, 100),
token_pattern = r"(?u)\b\w+\b",
analyzer='char')
关于python - sklearn 中字母的 N 元语法,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/53033877/