python-3.x - 使用 kmeans (sklearn) 对新文本进行预测?

标签 python-3.x scikit-learn nlp k-means

我有一个非常小的短字符串列表,我想对其进行 (1) 聚类并 (2) 使用该模型来预测新字符串属于哪个聚类.

运行第一部分工作正常,但获取新字符串的预测则不然。

第一部分

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans

# List of 
documents_lst = ['a small, narrow river',
                'a continuous flow of liquid, air, or gas',
                'a continuous flow of data or instructions, typically one having a constant or predictable rate.',
                'a group in which schoolchildren of the same age and ability are taught',
                '(of liquid, air, gas, etc.) run or flow in a continuous current in a specified direction',
                'transmit or receive (data, especially video and audio material) over the Internet as a steady, continuous flow.',
                'put (schoolchildren) in groups of the same age and ability to be taught together',
                'a natural body of running water flowing on or under the earth']


# 1. Vectorize the text
tfidf_vectorizer  = TfidfVectorizer(stop_words='english')
tfidf_matrix = tfidf_vectorizer.fit_transform(documents_lst)
print('tfidf_matrix.shape: ', tfidf_matrix.shape)

# 2. Get the number of clusters to make .. (find a better way than random)
num_clusters = 3

# 3. Cluster the defintions
km = KMeans(n_clusters=num_clusters, init='k-means++').fit(tfidf_matrix)

clusters = km.labels_.tolist()

print(clusters)

返回结果:

tfidf_matrix.shape:  (8, 39)
[0, 1, 0, 2, 1, 0, 2, 0]

第二部分

失败的部分:

predict_doc = ['A stream is a body of water with a current, confined within a bed and banks.']

tfidf_vectorizer  = TfidfVectorizer(stop_words='english')
tfidf_matrix = tfidf_vectorizer.fit_transform(predict_doc)
print('tfidf_matrix.shape: ', tfidf_matrix.shape)

km.predict(tfidf_matrix)

错误:

ValueError: Incorrect number of features. Got 7 features, expected 39

FWIW:我有点理解,矢量化后训练和预测具有不同数量的特征......

我对任何解决方案持开放态度,包括从 kmeans 更改为更适合短文本聚类的算法。

提前致谢

最佳答案

为了完整起见,我将用 here 的答案来回答我自己的问题。 ,这并不能回答这个问题。但回答了我的问题

from sklearn.cluster import KMeans

list1 = ["My name is xyz", "My name is pqr", "I work in abc"]
list2 = ["My name is xyz", "I work in abc"]

vectorizer = TfidfVectorizer(min_df = 0, max_df=0.5, stop_words = "english", charset_error = "ignore", ngram_range = (1,3))
vec = vectorizer.fit(list1)   # train vec using list1
vectorized = vec.transform(list1)   # transform list1 using vec

km = KMeans(n_clusters=2, init='k-means++', n_init=10, max_iter=1000, tol=0.0001, precompute_distances=True, verbose=0, random_state=None, n_jobs=1)

km.fit(vectorized)
list2Vec = vec.transform(list2)  # transform list2 using vec
km.predict(list2Vec)

功劳归@IrshadBhat

关于python-3.x - 使用 kmeans (sklearn) 对新文本进行预测?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/42825655/

相关文章:

python - 为什么我在 PyGame 中得到空白的灰色背景而不是动画?

python - 为什么我得到 "AttributeError: ' super' object has no attribute '__del__'“从我的 Thread 子类的析构函数调用 super 的析构函数时?

python - 为什么Matlab和scikit-learn使用PLS回归时结果不同?

nlp - 查找相似/相关文本算法

machine-learning - 意义的层次

python - NN VBD IN DT NNS RB 在 NLTK 中是什么意思?

python - pycparser 仅访问 if-else 语句的 else

python - virtualenvwrapper - 环境变量不保存

python - RandomForestClassifier.fit 在不同机器上使用不同数量的 RAM

python - 指定凝聚聚类中的最大距离(scikit 学习)