python - 尝试提取 URL 时使用 Urllibopener 时引发 HTTP 错误

标签 python selenium web-scraping

我正在尝试构建一个脚本,为 Mazda Miata 抓取 Craigslist。当函数“extract_post_url”尝试请求时,我收到错误。这是我试图遵循的教程: https://github.com/vprusso/youtube_tutorials/blob/master/web_scraping_and_automation/selenium/craigstlist_scraper.py

这是到目前为止的代码:

from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By

from bs4 import BeautifulSoup
import urllib.request 

class CraigslistScaper(object): 
    def __init__(self,query,location,max_price,transmission): 
        self.query = query
#        self.sort=sort
        self.location = location
#        self.postal = postal
        self.max_price = max_price
        self.transmission = auto_transmission


#https://sfbay.craigslist.org/search/cta?query=mazda+miata&sort=rel&max_price=6000&auto_transmission=1
        self.url = f"https://{location}.craigslist.org/search/cta?query={query}&sort=rel&max_price={max_price}&auto_transmission={transmission}"

        self.driver = webdriver.Chrome('/Users/MyLaptop/Desktop/chromedriver')
        self.delay = 5

    def load_craigslist_url(self): 
        self.driver.get(self.url)
        try:
            wait = WebDriverWait(self.driver, self.delay)
            wait.until(EC.presence_of_element_located((By.ID,"searchform")))              
            print("page is ready")
        except TimeoutError: 
            print('Loading took too much time')

    def extract_post_titles(self): 
        all_posts = self.driver.find_elements_by_class_name('result-row')
        post_titles_list=[]
        for post in all_posts: 
            print(post.text)
            post_titles_list.append(post.text)

    def extract_post_urls(self): 
        url_list = []
#        req = Request(self.url)
        html_page = urllib.request.urlopen(self.url)
        soup = BeautifulSoup(html_page,'lxml')
        for link in soup.findAll("a ", {"class": "result-title hrdlnk"}):
            print(link["href"])
            url_list.append(["href"])
            return url_list

    def quit(self): 
        self.driver.close()

location = "sfbay" 
#postal = "94519" 
max_price = "5000"
#radius = "250"
auto_transmission = 1
query = "Mazda Miata"

scraper = CraigslistScaper(query,location,max_price,auto_transmission)        

scraper.load_craigslist_url()
scraper.extract_post_titles()
scraper.extract_post_urls()
scraper.quit()

这是我收到的错误:

File "<ipython-input-2-edb38e647dc0>", line 1, in <module>
    runfile('/Users/MyLaptop/.spyder-py3/CraigslistScraper', wdir='/Users/MohitAsthana/.spyder-py3')

  File "/anaconda3/lib/python3.6/site-packages/spyder/utils/site/sitecustomize.py", line 705, in runfile
    execfile(filename, namespace)

  File "/anaconda3/lib/python3.6/site-packages/spyder/utils/site/sitecustomize.py", line 102, in execfile
    exec(compile(f.read(), filename, 'exec'), namespace)

  File "/Users/MyLaptop/.spyder-py3/CraigslistScraper", line 73, in <module>
    scraper.extract_post_urls()

  File "/Users/MyLaptop/.spyder-py3/CraigslistScraper", line 52, in extract_post_urls
    html_page = urllib.request.urlopen(req)

  File "/anaconda3/lib/python3.6/urllib/request.py", line 223, in urlopen
    return opener.open(url, data, timeout)

  File "/anaconda3/lib/python3.6/urllib/request.py", line 532, in open
    response = meth(req, response)

  File "/anaconda3/lib/python3.6/urllib/request.py", line 642, in http_response
    'http', request, response, code, msg, hdrs)

  File "/anaconda3/lib/python3.6/urllib/request.py", line 570, in error
    return self._call_chain(*args)

  File "/anaconda3/lib/python3.6/urllib/request.py", line 504, in _call_chain
    result = func(*args)

  File "/anaconda3/lib/python3.6/urllib/request.py", line 650, in http_error_default
    raise HTTPError(req.full_url, code, msg, hdrs, fp)

HTTPError: Bad Request

Chrome 打开正确的 URL,但在下载 URL 文件时出现错误。

最佳答案

此行有 2 个问题:

self.url = f"https://{location}.craigslist.org/search/cta?query={query}&sort=rel&max_price={max_price}&auto_transmission={transmission}"
  1. f"blah"是什么类型的字符串?这篇文章可能有错字,但我想我会指出来。

  2. “https://{location}.craigslist.org/search/cta?query={query}&sort=rel&max_price={max_price}&auto_transmission={transmission}” 不是有效的 URL - 您在什么时候将您的值(例如:self.transmission)替换为该字符串?

将该行替换为:

self.url = "https://{}.craigslist.org/search/cta?query={}&sort=rel&max_price={}&auto_transmission={}".format(self.location, self.query, self.max_price, self.transmission)

看看这是否有帮助。如果没有 - 打印 URL 而不是请求它。

关于python - 尝试提取 URL 时使用 Urllibopener 时引发 HTTP 错误,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/50205073/

相关文章:

python - 使用 scrapy 提取具有特定 css 类的链接

c# - 自动化表单字段输入和网页结果检查的程序

python - 最接近 Python 语法的语言是更底层的语言!

macos - Selenium WebDriver - 无法在 Mac OS X 上的 Chrome 中关闭选择下拉菜单

javascript - 对于给定的正则表达式,R 正则表达式编译器的工作方式不同

java - 范围报告 : Not able to see the screenshots on other machine

java - 向下滚动到页面底部加载 Instagram 图片

python - 美汤如何刮经纬度

python - 在 Python 中将二维 RGB 数组中的 'R' 和 'B' 元素设为零

python - 如何模拟异步协程?