python - BeautifulSoup 选择具有特定类的某些元素中的所有 href

我正在尝试从 this 中删除图像网站。我尝试使用 Scrapy(使用 Docker)和 scrapy/slenium。 Scrapy 似乎不适用于 windows10 home，所以我现在尝试使用 Selenium/Beautifulsoup。我正在 Anaconda 环境中使用带有 Spider 的 Python 3.6。

这就是我需要的 href 元素的样子:

<a class="emblem" href="detail/emblem/av1615001">

我有一个重大问题:
- 我应该如何使用 Beautifulsoup 选择 href？在我的代码下面，您可以看到我尝试过的内容(但没有成功)
- 由于可以观察到 href 只是 url 的部分路径...我应该如何处理这个问题？

这是我到目前为止的代码:

from bs4 import BeautifulSoup
from time import sleep
from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException
import urllib 
import requests
from os.path  import basename


def start_requests(self):
        self.driver = webdriver.Firefox("C:/Anaconda3/envs/scrapy/selenium/webdriver")
        #programPause = input("Press the <ENTER> key to continue...")
        self.driver.get("http://emblematica.grainger.illinois.edu/browse/emblems?Filter.Collection=Utrecht&Skip=0&Take=18")
        html = self.driver.page_source

        #html = requests.get("http://emblematica.grainger.illinois.edu/browse/emblems?Filter.Collection=Utrecht&Skip=0&Take=18")
        soup = BeautifulSoup(html, "html.parser")        
        emblemshref = soup.select("a", {"class" : "emblem", "href" : True})

        for href in emblemshref:
            link = href["href"]
            with open(basename(link)," wb") as f:
                f.write(requests.get(link).content)

        #click on "next>>"         
        while True:
            try:
                next_page = self.driver.find_element_by_xpath("//a[@id='next']")
                sleep(3)
                self.logger.info('Sleeping for 3 seconds')
                next_page.click()

                #here again the same emblemshref loop 

            except NoSuchElementException:
                #execute next on the last page
                self.logger.info('No more pages to load') 
                self.driver.quit()
                break

最佳答案

您可以通过类名获取 href:

que1:

for link in soup.findAll('a', {'class': 'emblem'}):
   try:
      print link['href']
   except KeyError:
      pass`

关于python - BeautifulSoup 选择具有特定类的某些元素中的所有 href，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/47653309/

python - BeautifulSoup 选择具有特定类的某些元素中的所有 href

上一篇：python - 使用 numpy 中的数组清理数组索引

下一篇：python - aiohttp 隐式地使我的方法发挥作用