python - 使用 NLTK 进行文本预处理

我正在练习使用 NLTK 从原始推文中删除某些特征，随后希望删除(对我而言)不相关的推文(例如空推文或单字推文)。但是，似乎有些单字推文并没有被删除。我还面临无法删除句子开头或结尾的任何停用词的问题。

有什么建议吗？目前，我希望传回一个句子作为输出，而不是一个标记化单词列表。

欢迎任何其他关于改进代码(处理时间、优雅)的评论。

import string
import numpy as np
import nltk
from nltk.corpus import stopwords

cache_english_stopwords=stopwords.words('english')
cache_en_tweet_stopwords=stopwords.words('english_tweet')

# For clarity, df is a pandas dataframe with a column['text'] together with other headers.

def tweet_clean(df):
    temp_df = df.copy()
    # Remove hyperlinks
    temp_df.loc[:, "text"] = temp_df.loc[:, "text"].replace('https?:\/\/.*\/\w*', '', regex=True)
    # Remove hashtags
    # temp_df.loc[:,"text"]=temp_df.loc[:,"text"].replace('#\w*', '', regex=True)
    temp_df.loc[:, "text"] = temp_df.loc[:, "text"].replace('#', ' ', regex=True)
    # Remove citations
    temp_df.loc[:, "text"] = temp_df.loc[:, "text"].replace('\@\w*', '', regex=True)
    # Remove tickers
    temp_df.loc[:, "text"] = temp_df.loc[:, "text"].replace('\$\w*', '', regex=True)
    # Remove punctuation
    temp_df.loc[:, "text"] = temp_df.loc[:, "text"].replace('[' + string.punctuation + ']+', '', regex=True)
    # Remove stopwords
    for tweet in temp_df.loc[:,"text"]:
        tweet_tokenized=nltk.word_tokenize(tweet)
        for w in tweet_tokenized:
            if (w.lower() in cache_english_stopwords) | (w.lower() in cache_en_tweet_stopwords):
                temp_df.loc[:, "text"] = temp_df.loc[:, "text"].replace('[\W*\s?\n?]'+w+'[\W*\s?]', ' ', regex=True)
                #print("w in stopword")
    # Remove quotes
    temp_df.loc[:, "text"] = temp_df.loc[:, "text"].replace('\&*[amp]*\;|gt+', '', regex=True)
    # Remove RT
    temp_df.loc[:, "text"] = temp_df.loc[:, "text"].replace('\s+rt\s+', '', regex=True)
    # Remove linebreak, tab, return
    temp_df.loc[:, "text"] = temp_df.loc[:, "text"].replace('[\n\t\r]+', ' ', regex=True)
    # Remove via with blank
    temp_df.loc[:, "text"] = temp_df.loc[:, "text"].replace('via+\s', '', regex=True)
    # Remove multiple whitespace
    temp_df.loc[:, "text"] = temp_df.loc[:, "text"].replace('\s+\s+', ' ', regex=True)
    # Remove single word sentence
    for tweet_sw in temp_df.loc[:, "text"]:
        tweet_sw_tokenized = nltk.word_tokenize(tweet_sw)
        if len(tweet_sw_tokenized) <= 1:
            temp_df.loc["text"] = np.nan
    # Remove empty rows
    temp_df.loc[(temp_df["text"] == '') | (temp_df['text'] == ' ')] = np.nan
    temp_df = temp_df.dropna()
    return temp_df

最佳答案

什么是 df？推文列表？您也许应该考虑一个接一个地清理推文，而不是作为推文列表。如果有一个函数 tweet_cleaner(single_tweet) 会更容易。

nltk 提供了一个 TweetTokenizer清理推文。

"re" package提供了使用正则表达式的良好解决方案。

我建议您创建一个变量以便于使用 temp_df.loc[:, "text"]

[此处] ( Stopword removal with NLTK ) 描述了删除句子中的停用词: clean_wordlist = [i for i in sentence.lower().split() if i not in stopwords]

如果你想使用正则表达式(用re包)，你可以

创建一个由所有停用词组成的正则表达式模式(在 tweet_clean 函数之外): stop_pattern = re.compile('|'.join(stoplist)(?siu))
(?siu) 用于多行、ignorecase、unicode
并使用这个模式来清理任何字符串 clean_string = stop_pattern.sub('', input_string)

(如果不需要单独的非索引字表，您可以将 2 个非索引字表连接起来)

要删除 1 个单词的推文，您只能保留最长超过 1 个单词的推文:
如果 len(tweet_sw_tokenized) >= 1: keep_ones.append(tweet_sw)

关于python - 使用 NLTK 进行文本预处理，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/39452842/

python - 使用 NLTK 进行文本预处理

上一篇：python - 用python划分两个数据框

下一篇：python - 如何在 read_csv 中将非数字条目转换为 NaN