python - 删除停用词/标点符号，标记并应用 Counter()

标签 python python-3.x string counter tokenize

我编写了一个函数来删除停用词并进行标记，如下所示:

def process(text, tokenizer=TweetTokenizer(), stopwords=[]):    
    text = text.lower()  
    tokens = tokenizer.tokenize(text)
    return [tok for tok in tokens if tok not in stopwords and not tok.isdigit()]

我将其应用到列tweet['cleaned_text']，如下所示:

punct = list(string.punctuation)
stopword_list = stopwords.words('english') + punct + ['rt', 'via', '...','“', '”','’']

tf = Counter()
for i  in list(tweet['cleaned_text']):
    temp=process(i, tokenizer=TweetTokenizer(), stopwords=stopword_list)
    tf.update(temp)   
for tag, count in tf.most_common(20):
        print("{}: {}".format(tag, count))

输出应该是最常见的单词。这里有:

#blm: 12718
black: 2751
#blacklivesmatter: 2054
people: 1375
lives: 1255
matter: 1039
white: 914
like: 751
police: 676
get: 564
movement: 563
support: 534
one: 534
racist: 532
know: 520
us: 471
blm: 449
#antifa: 414
hate: 396
see: 382

正如您所看到的，我无法删除主题标签#，尽管它包含在标点符号列表中(一些停用词也很明显)。 #blm 和 blm 应该相同，但它们却被重复计算。

我一定是在代码中遗漏了一些东西。

最佳答案

当你处理标记时，你会保留整个单词，如果你想去掉前导的#，你可以使用str.strip("#")

def process(text, tokenizer=TweetTokenizer(), stopwords=[]):    
    text = text.lower()  
    tokens = tokenizer.tokenize(text)
    return [tok.strip("#") for tok in tokens if tok not in stopwords and not tok.isdigit()]

关于python - 删除停用词/标点符号，标记并应用 Counter()，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/62934652/

上一篇：ruby-on-rails - 在 Windows 10 上显示图像错误栏 6 的操作文本

下一篇：c# - 从一个 Action 中按原样返回 JSON (PascalCase)

相关文章：

python - 我将 Python 包放在 Mac 上的哪里？

python-3.x - python-vlc无法播放并通过youtube视频链接做出响应吗？

python - 如何在没有大量 if/elif/elif/... 条件的情况下查找代码中的数据？

java - 字符串中的转义字符排除 html 标签

python - 如何在 matplotlib python 中定义边界？

python - 字典解析Python

python - 将列从 Pandas 日期对象更改为 python 日期时间

python - 计算按日期和标签分组的行中列表元素的频率

string - 如何评估 Access 查询中另一个字段中包含的字段名称？

python - 字符串替换的无错误版本