我正在尝试使用以下模式从网页中提取 URL:
'http://www.realclearpolitics.com/epolls/????/governor/??/-.html'
我当前的代码提取所有链接。如何更改代码以仅提取与模式匹配的 URL?谢谢!
import requests
from bs4 import BeautifulSoup
def find_governor_races(html):
url = html
base_url = 'http://www.realclearpolitics.com/'
page = requests.get(html).text
soup = BeautifulSoup(page,'html.parser')
links = []
for a in soup.findAll('a', href=True):
links.append(a['href'])
find_governor_races('http://www.realclearpolitics.com/epolls/2010/governor/2010_elections_governor_map.html')
最佳答案
您可以提供 regular expression pattern作为 .find_all()
的 href
参数值:
import re
pattern = re.compile(r"http://www.realclearpolitics.com\/epolls/\d+/governor/.*?/.*?.html")
links = soup.find_all("a", href=pattern)
关于python-2.7 - 如何提取与模式匹配的 URL,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/37285246/