作为 Python 中的信息检索项目(构建一个迷你搜索引擎)的一部分,我想从下载的推文中保留干净的文本(推文的 .csv 数据集 - 准确地说是 27000 条推文),一条推文将如下所示:
"The basic longing to live with dignity...these yearnings are universal. They burn in every human heart 1234." —@POTUS https://twitter.com/OZRd5o4wRL
或
"Democracy...allows us to peacefully work through our differences, and move closer to our ideals" —@POTUS in Greece https://twitter.com/PIO9dG2qjX
我想使用正则表达式删除推文中不需要的部分,例如 URL、标点符号等
所以结果会是:
"The basic longing to live with dignity these yearnings are universal They burn in every human heart POTUS"
和
"Democracy allows us to peacefully work through our differences and move closer to our ideals POTUS in Greece"
试过这个:pattern = RegexpTokenizer(r'[A-Za-z]+|^[0-9]')
,但它做的并不完美,作为例如,URL 仍然存在于结果中。
请帮我找到一个能满足我要求的正则表达式模式。
最佳答案
这可能会有所帮助。
演示:
import re
s1 = """"Democracy...allows us to peacefully work through our differences, and move closer to our ideals" —@POTUS in Greece https://twitter.com/PIO9dG2qjX"""
s2 = """"The basic longing to live with dignity...these yearnings are universal. They burn in every human heart 1234." —@POTUS https://twitter.com/OZRd5o4wRL"""
def cleanString(text):
res = []
for i in text.strip().split():
if not re.search(r"(https?)", i): #Removes URL..Note: Works only if http or https in string.
res.append(re.sub(r"[^A-Za-z\.]", "", i).replace(".", " ")) #Strip everything that is not alphabet(Upper or Lower)
return " ".join(map(str.strip, res))
print(cleanString(s1))
print(cleanString(s2))
关于python - 保持 url 中的文本干净,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/51985530/