python - 保持 url 中的文本干净

作为 Python 中的信息检索项目(构建一个迷你搜索引擎)的一部分，我想从下载的推文中保留干净的文本(推文的 .csv 数据集 - 准确地说是 27000 条推文)，一条推文将如下所示:

"The basic longing to live with dignity...these yearnings are universal. They burn in every human heart 1234." ‚Äî@POTUS https://twitter.com/OZRd5o4wRL

或

"Democracy...allows us to peacefully work through our differences, and move closer to our ideals" ‚Äî@POTUS in Greece https://twitter.com/PIO9dG2qjX

我想使用正则表达式删除推文中不需要的部分，例如 URL、标点符号等

所以结果会是:

"The basic longing to live with dignity these yearnings are universal They burn in every human heart POTUS"

和

"Democracy allows us to peacefully work through our differences and move closer to our ideals POTUS in Greece"

试过这个:pattern = RegexpTokenizer(r'[A-Za-z]+|^[0-9]')，但它做的并不完美，作为例如，URL 仍然存在于结果中。

请帮我找到一个能满足我要求的正则表达式模式。

最佳答案

这可能会有所帮助。

演示:

import re

s1 = """"Democracy...allows us to peacefully work through our differences, and move closer to our ideals" ‚Äî@POTUS in Greece https://twitter.com/PIO9dG2qjX"""
s2 = """"The basic longing to live with dignity...these yearnings are universal. They burn in every human heart 1234." ‚Äî@POTUS https://twitter.com/OZRd5o4wRL"""    

def cleanString(text):
    res = []
    for i in text.strip().split():
        if not re.search(r"(https?)", i):   #Removes URL..Note: Works only if http or https in string.
            res.append(re.sub(r"[^A-Za-z\.]", "", i).replace(".", " "))   #Strip everything that is not alphabet(Upper or Lower)
    return " ".join(map(str.strip, res))

print(cleanString(s1))
print(cleanString(s2))

关于python - 保持 url 中的文本干净，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/51985530/

python - 保持 url 中的文本干净

上一篇：python - 使用 python selenium 截取一个元素显示屏幕错误部分的图像

下一篇： python Pandas : groupby on two columns and create new variables