python - 如何将 sklearn tfidf 矢量 pandas 输出转换为有意义的格式

标签 python pandas scikit-learn tf-idf tfidfvectorizer

我已经使用 sklearn 获取我的语料库的 tfidf 分数，但输出不是我想要的格式。

代码:

vect = TfidfVectorizer(ngram_range=(1,3))
tfidf_matrix = vect.fit_transform(df_doc_wholetext['csv_text'])

df = pd.DataFrame(tfidf_matrix.toarray(),columns=vect.get_feature_names())

df['filename'] = df.index

我拥有的:

word1、word2、word3 可以是语料库中的任何单词。例如，我将它们称为 word1 、 word2 、 word3 。

我需要什么:

我尝试对其进行转换，但它会将所有列转换为行。有没有办法实现这一点？

最佳答案

df1 = df.filter(like='word').stack().reset_index()
df1.columns = ['filename','word_name','score']

输出:

   filename word_name  score
0         0     word1   0.01
1         0     word2   0.04
2         0     word3   0.05
3         1     word1   0.02
4         1     word2   0.99
5         1     word3   0.07

常规列标题的更新:

df1 = df.iloc[:,1:].stack().reset_index()

关于python - 如何将 sklearn tfidf 矢量 pandas 输出转换为有意义的格式，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/57629697/

上一篇：python - 从 numpy 中的图像中提取缩略图

下一篇：python - 如何在python中使用Selenium或请求提取<img src链接和其他内容？

相关文章：

python - 更改 matplotlib imshow() 图形轴上的值

python - 集合双端队列构造函数

python - 在groupby之后创建日期时间索引

python - Pandas:在每列的每个时间戳处找到非 NaN 记录的累积总和

python - 如何将函数应用于 pandas 数据框中的每一行？

python - 在子进程Popen中使用python

python - 错误 R10(启动超时)Heroku 使用 python 脚本

python - Kmeans 与 dataframe 中的 groupby 并在 python 中获取集群

python - sklearn kneighbours内存错误python

python - GridSearchCV最终模型