python-2.7 - 如何提取与模式匹配的 URL

标签 python-2.7 web-scraping beautifulsoup python-requests

我正在尝试使用以下模式从网页中提取 URL:

'http://www.realclearpolitics.com/epolls/????/governor/??/-.html'

我当前的代码提取所有链接。如何更改代码以仅提取与模式匹配的 URL？谢谢!

import requests
from bs4 import BeautifulSoup

def find_governor_races(html):
    url = html
    base_url = 'http://www.realclearpolitics.com/'
    page = requests.get(html).text
    soup = BeautifulSoup(page,'html.parser')  
    links = []
    for a in soup.findAll('a', href=True):
            links.append(a['href'])
find_governor_races('http://www.realclearpolitics.com/epolls/2010/governor/2010_elections_governor_map.html')

最佳答案

您可以提供 regular expression pattern作为 .find_all() 的 href 参数值:

import re

pattern = re.compile(r"http://www.realclearpolitics.com\/epolls/\d+/governor/.*?/.*?.html")
links = soup.find_all("a", href=pattern)

关于python-2.7 - 如何提取与模式匹配的 URL，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/37285246/

上一篇：embedded-linux - 当 wifi 断开连接时，如何使用 wpa_cli 获取 WiFi 断开连接事件

下一篇：xml - Open Office XML 解析错误

相关文章：

python - 我的代码中的无限循环，我不明白它来自哪里

django - 使用 urls.py 时出现 TypeError Django

python - 将列表作为多个参数传递

python - 从指定id的div开始获取嵌套的div内容

python - 在 Python 的 span 标签中查找多个属性

python - 无法使用 BS4 从 <a> 标签中提取 href 值

python-2.7 - 我应该并行运行多少个进程？

php - 无法从网页获取产品名称

c# - HtmlAgilityPack 的字数

python - beautifulsoup 检索日期