python - 抓取 href 链接

标签 python xml web-scraping beautifulsoup

尝试使用正确的关键字收集此页面上的特定链接,到目前为止我有:

from bs4 import BeautifulSoup
import random
url = 'http://www.thenextdoor.fr/en/4_adidas-originals'
r = requests.get(url)
soup = BeautifulSoup(r.text, 'lxml')
raw = soup.findAll('a', {'class':'add_to_compare'})
links = raw['href']
keyword1 = 'adidas'
keyword2 = 'thenextdoor'
keyword3 = 'uncaged'
for link in links:
    text = link.text
    if keyword1 in text and keyword2 in text and keyword3 in text:

我正在尝试提取 this link

最佳答案

您可以使用 all() 检查是否所有都存在,以及是否存在 any()

from bs4 import BeautifulSoup
import requests

res = requests.get("http://www.thenextdoor.fr/en/4_adidas-originals").content
soup = BeautifulSoup(res)

atags = soup.find_all('a', {'class':'add_to_compare'})
links = [atag['href'] for atag in atags]
keywords = ['adidas', 'thenextdoor', 'Uncaged']

for link in links:  
    if all(keyword in link for keyword in keywords):
        print link

输出:

http://www.thenextdoor.fr/en/clothing/2042-adidas-originals-Ultraboost-Uncaged-2303002052017.html
http://www.thenextdoor.fr/en/clothing/2042-adidas-originals-Ultraboost-Uncaged-2303002052017.html

关于python - 抓取 href 链接,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/41645998/

相关文章:

sql-server - XQuery T-​​sql 将节点插入到所有元素中

Facebook 对象调试器 - 无法将主机名解析为有效的 IP 地址

python - 正则表达式中两个字符之间的边界字符串

Python 字典或替代品

java - XML 命名空间在 XPath + java 中解析文件时出现问题

Python beautifulsoup - 获取输入值/TypeError : 'NoneType' object is not subscriptable

Python:元素未附加到页面文档

python - 马尔可夫链蒙特卡罗(python,numpy)

python - 使用 python 在 Linux 中合并文件时文件大小大大减少

xml - INCLUDETEXT通过XPath选择特定的节点