python - 使用Scrapy抓取目标网站时出现JSONDecodeError

标签 python json web-scraping scrapy

直到 3 天前,我才能够抓取 target site 。但是,它开始显示我将在下面发布的错误。当我查看该网站的源代码时,我看不到任何变化。它还以 scrapy (200) 响应的形式返回。我正在使用代理和用户代理。我改变了它们,但结果仍然相同。我不断收到 json 解码错误。

错误:

File "/usr/lib/python3.8/json/decoder.py", line 355, in raw_decode raise JSONDecodeError("Expecting value", s, err.value) from None json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

我的代码:

import scrapy
import json
import datetime
import bs4
import re
import time
from requests.models import PreparedRequest
import logging
from hepsibura_spider.items import HepsiburaSpiderItem
from scrapy.crawler import CrawlerProcess

class HepsiburaSpider(scrapy.Spider):
    name = 'hepsibura'
    # allowed_domains = ['www.hepsibura.com']
    handle_httpstatus_list = [301]
    def start_requests(self):
        urls = [
            'https://www.hepsiburada.com/monitor-bilgisayarlar-c-116465?filtreler=satici:Hepsiburada;?_random_number={rn}#tabIndex=0',
            
        ]
        for url in urls:
            params = []
            # added a meta to provide the used url here
            main_url, parameters = url.split('&') if '&' in url else url, None
            parameters = parameters.split(':') if parameters else []
            for parameter in parameters:
                key, value = parameter.split('=')
                params.append((key.strip(), value.strip()))

            # params.append(('main_url', main_url))

            if 'sayfa' not in dict(params):
                params.append(('sayfa', '1'))

            yield scrapy.Request(
                url=url.format(rn=time.time()),
                callback=self.parse_json,
                meta={
                    'main_url': main_url,
                    'params': dict(params),
                },
                headers={
                    'Cache-Control': 'store, no-cache, must-revalidate, post-check=0, pre-check=0',
                    'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 11_10) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.5134.152 Safari/537.36',
                }
            )
    
    def parse_json(self, response):

        if response.status == 301:
            logging.log(logging.INFO, 'Finished scraping')
            return
        current_url = response.request.url.split('&')[0].strip()
        parameters = response.meta.get('params')

        soup = bs4.BeautifulSoup(response.text,'lxml')
        scripts = soup.select('script')
        data_script = ''
        for script in scripts:
            # print(script.text)
            if 'window.MORIA.PRODUCTLIST = {' in str(script):
                print('Found the data')
                data_script = str(script)
                break
        
        data_script = data_script.replace('<script type="text/javascript">','').replace('window.MORIA = window.MORIA || {};','').replace('window.MORIA.PRODUCTLIST = {','').replace('\'STATE\': ', '').replace('</script>','')[:-4]
        json_data = json.loads(data_script)
        products = json_data['data']['products']
        for product in products:
            item = HepsiburaSpiderItem()

            item['rowid'] = hash(str(datetime.datetime.now()) + str(product['productId']))
            item['date'] = str(datetime.datetime.now())
            item['listing_id'] = product['variantList'][0]["listing"]["listingId"]
            item['product_id'] = product['variantList'][0]["sku"].lower()
            item['product_name'] = product['variantList'][0]['name']
            item['price'] = float(product['variantList'][0]['listing']['priceInfo']['price'])
            item['url'] = 'https://www.hepsiburada.com' + product['variantList'][0]["url"]
            item['merchantName'] = product['variantList'][0]["listing"]["merchantName"].lower()

            yield item
        
        parameters['sayfa'] = int(parameters['sayfa']) + 1
        req = PreparedRequest()
        req.prepare_url(current_url, parameters)

        yield scrapy.Request(
            url=req.url,
            callback=self.parse_json,
            meta={
                'params': parameters,
            },
            headers={
                'Cache-Control': 'store, no-cache, must-revalidate, post-check=0, pre-check=0',
                'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 11_10) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.5134.152 Safari/537.36',
            }
        )


if __name__ == '__main__':
    process = CrawlerProcess()
    process.crawl(HepsiburaSpider)
    process.start()

我发现了一些东西。网站更改了 json 格式。每个请求都会生成唯一的 id:

window.MORIA.PRODUCTLIST = Object.assign(window.MORIA.PRODUCTLIST || {}, {
'60cada8e-57dd-466e-f7af-62efca4fa8a8': {

如何绕过这个?

谢谢。

最佳答案

确实没有必要将BeautifulSoup与scrapy一起使用。

问题是 data_script 为空。

摆脱循环,只需使用 xpath 选择包含该文本的 script 标签,然后使用 re_first()函数获取 JSON 字符串。

此外,您可能需要检查data 是否不为空以供以后使用。

# soup = bs4.BeautifulSoup(response.text, 'lxml')
# scripts = soup.select('script')
# data_script = ''
# for script in scripts:
#     # print(script.text)
#     if 'window.MORIA.PRODUCTLIST = {' in str(script):
#         print('Found the data')
#         data_script = str(script)
#         break
data = response.xpath('//script[contains(text(), "window.MORIA.PRODUCTLIST")]/text()').re_first(r'\'STATE\': ({.+})')

#data_script = response.xpath('//script[contains(text(), "window.MORIA.PRODUCTLIST")]/text()').re()
#data_script = data_script.replace('<script type="text/javascript">', '').replace('window.MORIA = window.MORIA || {};', '').replace('window.MORIA.PRODUCTLIST = {', '').replace('\'STATE\': ', '').replace('</script>', '')[:-4]
json_data = json.loads(data)
products = json_data['data']['products']

输出:

{'rowid': -1443611402678861624, 'date': '2022-09-18 16:25:58.168075', 'listing_id': 'fd2eb812-f483-4233-bfea-610490e16014', 'product_id': 'hbcv000013tlaw', 'product_name': 'MSI PRO\xa016T 10M-043TR Intel Celeron 5205U 4GB\xa0128GB SSD Windows 10 Pro 15.6" All In One Bilgisayar', 'price': 12491.21, 'url': 'https://www.hepsiburada.com/msi-pro-16t-10m-043tr-intel-celeron-5205u-4gb-128gb-ssd-windows-10-pro-15-6-all-in-one-bilgisayar-p-HBCV000013TLAW?magaza=Hepsiburada', 'merchantName': 'hepsiburada'}
DEBUG: Scraped from <200 https://www.hepsiburada.com/monitor-bilgisayarlar-c-116465?filtreler=satici:Hepsiburada;?_random_number=1663507557.5583432>
{'rowid': 557951834722614927, 'date': '2022-09-18 16:25:58.168075', 'listing_id': '76835cb4-74d3-498b-821a-58311752c934', 'product_id': 'hbcv0000065deb', 'product_name': 'Apple iMac M1 Çip 8GB 512GB SSD macOS Retina 24" FHD All In One Bilgisayar MGPJ3TU/A Yeşil', 'price': 34998.99, 'url': 'https://www.hepsiburada.com/apple-imac-m1-cip-8gb-512gb-ssd-macos-retina-24-fhd-all-in-one-bilgisayar-mgpj3tu-a-yesil-p-HBCV0000065DEB?magaza=Hepsiburada', 'merchantName': 'hepsiburada'}
DEBUG: Scraped from <200 https://www.hepsiburada.com/monitor-bilgisayarlar-c-116465?filtreler=satici:Hepsiburada;?_random_number=1663507557.5583432>
{'rowid': 2200588215358298971, 'date': '2022-09-18 16:25:58.168075', 'listing_id': '17fd6809-afa9-4e21-a772-509864d9bf28', 'product_id': 'hbcv000014z6fy', 'product_name': 'MSI MODERN AM241 11M-298TR Intel Pentium 7505 4GB 128GB SSD Windows 10 Pro 23.8" All In One Bilgisayar', 'price': 15335.09, 'url': 'https://www.hepsiburada.com/msi-modern-am241-11m-298tr-intel-pentium-7505-4gb-128gb-ssd-windows-10-pro-23-8-all-in-one-bilgisayar-p-HBCV000014Z6FY?magaza=Hepsiburada', 'merchantName': 'hepsiburada'}
DEBUG: Scraped from <200 https://www.hepsiburada.com/monitor-bilgisayarlar-c-116465?filtreler=satici:Hepsiburada;?_random_number=1663507557.5583432>
{'rowid': 2433557015268455354, 'date': '2022-09-18 16:25:58.170501', 'listing_id': '0f0e1577-f1c2-4df5-ae9c-6c754317e998', 'product_id': 'hbcv00001eo94e', 'product_name': 'MSI MODERN AM271P 11M-021XTR Intel Core i7 1165G7 16GB 512GB SSD Freedos 27" FHD All In One Bilgisayar', 'price': 26754.38, 'url': 'https://www.hepsiburada.com/msi-modern-am271p-11m-021xtr-intel-core-i7-1165g7-16gb-512gb-ssd-freedos-27-fhd-all-in-one-bilgisayar-p-HBCV00001EO94E?magaza=Hepsiburada', 'merchantName': 'hepsiburada'}
Scraped from <200 https://www.hepsiburada.com/monitor-bilgisayarlar-c-116465?filtreler=satici:Hepsiburada;?_random_number=1663507557.5583432>
...
...

关于python - 使用Scrapy抓取目标网站时出现JSONDecodeError,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/73762114/

相关文章:

javascript - 我如何只对 Javascript 对象的一部分进行 JSON 编码?

c# 像 facebook 和 linkedin 一样从 url 获取内容

java - Selenium 'Unable to locate element'

python - 有什么方法可以加快Seaborns Pairplot的速度

python - 将字符串拆分为列表,保留重音字符和表情符号,但删除标点符号

json - Swift Alamofire JSON 响应有括号

javascript - 使用 php 从网络服务器读取时,JSON 文件返回 null。

python - 使用 BeautifulSoup 在 Python 中抓取缺货通知程序

python - Celery 任务未被处理

python - 如何使用 OpenCV 在 Python 中管理大图像?