python - 如何从推文中提取或抓取所有缩短的 URL？

我想从推文中提取缩短的 URL(如果有)。这些 URL 遵循标准格式:http://t.co (details here)

为此，我使用了以下正则表达式，当我通过将文本存储为字符串来测试推文文本时，该表达式运行良好。

注意: 我正在使用https://shortnedurl/string而不是真正的缩短的 URL，因为 StackOverflow 不允许在此处发布此类 URL。

示例代码:

import re

tweet = "Grim discovery in the USS McCain collision probe https://shortnedurl.com @MattRiversCNN reports #TheLead"

urls = re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+',
                  tweet)
for url in urls:
    print "printing urls", url

这段代码的输出:

printing urls https://shortnedurl.com

但是，当我使用 Twitter 的 API 读取推文并对其运行相同的正则表达式时，我得到以下不希望有的输出。

printing urls https://https://shortnedurl/string
printing urls https://https://shortnedurl/string</a></span>
printing urls https://twitter.com/MattRiversCNN
printing urls https://twitter.com/search?q=%23TheLead

它似乎正在获取 Twitter ID 的 URL 以及主题标签。

我该如何处理这个问题？我只想读这些 http://t.co网址。

更新1: 我尝试了 https?://t.co/\S*，但是，我仍然收到以下嘈杂的网址:

printing urls https://https://shortnedurl/string
printing urls https://https://shortnedurl/string>https://https://shortnedurl/string</a></span>

我不知道为什么用 </a><span> 再次找到相同的 URL .

对于 https?://t.co/\S+，我得到无效 URL，因为它将上述两个 URL 合并为一个:

printing urls https://https://shortnedurl/string>https://https://shortnedurl/string</a></span>

更新2: 推文文本看起来与我的预期有点不同:

    Grim discovery in the USS McCain collision probe 
<span class="link"><a href="https://shortenedurl">https://shortenedurl</a></span> <span class="username"><a 
href="https://twitter.com/MattRiversCNN">@MattRiversCNN</a></span>
     reports <span class="tag"><a href="https://twitter.com/search?
    q=%23TheLead">#TheLead</a></span>

最佳答案

您可以使用正则表达式

https?://t\.co/\S+

关于python - 如何从推文中提取或抓取所有缩短的 URL？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/45835591/

python - 如何从推文中提取或抓取所有缩短的 URL？

上一篇：python - 在 PySpark 中缓存用户和产品潜在特征以缩短预测时间

下一篇：python - 合并相似列上的两个数据框