我正在对一些 Twitter 数据进行一些自然语言处理。因此,我成功地加载并清理了一些推文,并将其放入下面的数据框中。
id text
1104159474368024599 repmiketurner the only time that michael cohen told the truth is when he pled that he is guilty also when he said no collusion and i did not tell him to lie
1104155456019357703 rt msnbc president trump and first lady melania trump view memorial crosses for the 23 people killed in the alabama tornadoes t
问题是我正在尝试构建一个术语频率矩阵,其中每一行都是一条推文,每一列都是该单词在特定行中出现的值。我唯一的问题是其他帖子提到术语频率分布文本文件。这是我用来生成上面的数据框的代码
import nltk.classify
from nltk.tokenize import word_tokenize
from nltk.tokenize import wordpunct_tokenize
from nltk.corpus import stopwords
from nltk.probability import FreqDist
df_tweetText = df_tweet
#Makes a dataframe of just the text and ID to make it easier to tokenize
df_tweetText = pd.DataFrame(df_tweetText['text'].str.replace(r'[^\w\s]+', '').str.lower())
#Removing Stop words
#nltk.download('stopwords')
stop = stopwords.words('english')
#df_tweetText['text'] = df_tweetText.apply(lambda x: [item for item in x if item not in stop])
#Remove the https linkes
df_tweetText['text'] = df_tweetText['text'].replace("[https]+[a-zA-Z0-9]{14}",'',regex=True, inplace=False)
#Tokenize the words
df_tweetText
起初我尝试使用函数word_dist = nltk.FreqDist(df_tweetText['text'])但它最终会计算整个句子的值而不是行中每个单词的值。
我尝试过的另一件事是使用 df_tweetText['text'] = df_tweetText['text'].apply(word_tokenize) 标记每个单词,然后再次调用 FeqDist但这给了我一个错误,指出unhashable type: 'list'。
1104159474368024599 [repmiketurner, the, only, time, that, michael, cohen, told, the, truth, is, when, he, pled, that, he, is, guilty, also, when, he, said, no, collusion, and, i, did, not, tell, him, to, lie]
1104155456019357703 [rt, msnbc, president, trump, and, first, lady, melania, trump, view, memorial, crosses, for, the, 23, people, killed, in, the, alabama, tornadoes, t]
是否有一些替代方法来尝试构建这个术语频率矩阵?理想情况下,我希望我的数据看起来像这样
id |collusion | president |
------------------------------------------
1104159474368024599 | 1 | 0 |
1104155456019357703 | 0 | 2 |
编辑 1:所以我决定看一下 textmining 库并重新创建了他们的一个示例。唯一的问题是,它使用每条推文的一行创建术语文档矩阵。
import textmining
#Creates Term Matrix
tweetDocumentmatrix = textmining.TermDocumentMatrix()
for column in df_tweetText:
tweetDocumentmatrix.add_doc(df_tweetText['text'].to_string(index=False))
# print(df_tweetText['text'].to_string(index=False))
for row in tweetDocumentmatrix.rows(cutoff=1):
print(row)
EDIT2:所以我尝试了 SKlearn,但这种方法有效,但问题是我在我的列中发现了不应该存在的中文/日文字符。另外,由于某种原因,我的列显示为数字
from sklearn.feature_extraction.text import CountVectorizer
corpus = df_tweetText['text'].tolist()
vec = CountVectorizer()
X = vec.fit_transform(corpus)
df = pd.DataFrame(X.toarray(), columns=vec.get_feature_names())
print(df)
00 007cigarjoe 08 10 100 1000 10000 100000 1000000 10000000 \
0 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0 0
最佳答案
通过迭代每一行可能不是最佳的,但可以。里程可能会根据推文的长度和正在处理的推文数量而有所不同。
import pandas as pd
from collections import Counter
# example df
df = pd.DataFrame()
df['tweets'] = [['test','xd'],['hehe','xd'],['sam','xd','xd']]
# result dataframe
df2 = pd.DataFrame()
for i, row in df.iterrows():
df2 = df2.append(pd.DataFrame.from_dict(Counter(row.tweets), orient='index').transpose())
关于python - 从 Python Dataframe 创建术语频率矩阵,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/55113812/