python - Scrapy 输出 JSON 或 CSV

标签 python json excel csv scrapy

我正在尝试使用此代码进行网页抓取 设置.py

FEED_EXPORT_ENCODING = 'utf-8'

import datetime
now = datetime.datetime.now ()
formatted = now.strftime ("%Y%m%d_%H%M")
FEED_URI = f'\\C:\\Users\\Acer\\Desktop\\{formatted}.csv'
FEED_TYPE = 'csv'

有了这个 special_offers.py
# -*- coding: utf-8 -*-
import scrapy
import datetime


class SpecialOffersSpider(scrapy.Spider):
    name = 'special_offers'
    allowed_domains = ['www.tinydeal.com']

    def start_requests(self):
        yield scrapy.Request(url='https://www.tinydeal.com/specials.html', callback=self.parse, headers={
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36'
        })

    def parse(self, response):
        for product in response.xpath("//ul[@class='productlisting-ul']/div/li"):
            yield {
                'title': product.xpath(".//a[@class='p_box_title']/text()").get(),
                'url': response.urljoin(product.xpath(".//a[@class='p_box_title']/@href").get()),
                'discounted_price': product.xpath(".//div[@class='p_box_price']/span[1]/text()").get(),
                'original_price': product.xpath(".//div[@class='p_box_price']/span[2]/text()").get(),
                'User-Agent': response.request.headers['User-Agent'].decode('utf-8'),
                'datetime': datetime.datetime.now().strftime("%Y%m%d %H%M")

            }

        next_page = response.xpath("//a[@class='nextPage']/@href").get()

        if next_page:
            yield scrapy.Request(url=next_page, callback=self.parse, headers={
                'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36'
            })

然后我打开终端并使用
scrapy crawl special_offers

问题是,当我导出 JSON 时,数据在 }{ 之间没有逗号。例如,使 Power BI 无法读取我的文件

当我导出 CSV 时,数据与我期望使用 EXCEL 打开时不同

CSV 数据示例
{“title”:“用于 Raspberry Pi 3 Model B 和 Raspberry Pi 2 E-524988 的 ABS 塑料外壳”,“url”:“https://www.tinydeal.com/abs-plastic-case-for-raspberry-pi-3-model-b-raspberry-pi-2-p-163950.html”,“discounted_price”:“R$12.74”,“original_price”:“R$13.66 ", "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36", "datetime": "20200420 2330"}
{"title": "3M 9001 KN90 防尘口罩呼吸器防尘 PM2.5 工业建筑Polle RTH-562440", "url": "https://www.tinydeal.com/3m-9001-kn90-dust-masks-respirator-anti-dust-pm25-industrial-construction-polle-p-179487.html", "discounted_price": "R$10.29", "original_price": "12.40 雷亚尔”、“用户代理”:“Mozilla/5.0(Windows NT 10.0;Win64;x64)AppleWebKit/537.36(KHTML,如 Gecko)Chrome/76.0.3809.100 Safari/537.36”,“日期时间”:“20200420 2330” }
{"title": "二合一复古蓝色水钻项链 + 耳环首饰套装DJA-562974", "url": "https://www.tinydeal.com/2-in-1-vintage-blue-rhinestone-necklace-earring-jewelry-set-p-180097.html", "discounted_price": "R$11.77", "original_price": "R$30.77 ", "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36", "datetime": "20200420 2330"}
{"title": "64GB USB 2.0 闪存盘 USB Pen Drive U 盘EFM-561923", "url": "https://www.tinydeal.com/64gb-usb-20-flash-drive-usb-pen-drive-u-disk-p-178875.html", "discounted_price": "R$34.83", "original_price": "R$99.43", "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36", "datetime": "20200420 2330"}


JSON 数据示例

{
"title": "用于 Raspberry Pi 3 Model B 和 Raspberry Pi 2 E-524988 的 ABS 塑料外壳",
“网址”:“https://www.tinydeal.com/abs-plastic-case-for-raspberry-pi-3-model-b-raspberry-pi-2-p-163950.html”,
"discounted_price": "12.74 雷亚尔",
"original_price": "13.66 雷亚尔",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36",
“日期时间”:“20200420 2329”
}
{
"title": "3M 9001 KN90 防尘口罩呼吸器防尘 PM2.5 工业建筑 Polle RTH-562440",
“网址”:“https://www.tinydeal.com/3m-9001-kn90-dust-masks-respirator-anti-dust-pm25-industrial-construction-polle-p-179487.html”,
"discounted_price": "10.29 雷亚尔",
"original_price": "12.40 雷亚尔",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36",
“日期时间”:“20200420 2329”
}
{
"title": "二合一复古蓝色水钻项链+耳环首饰套装DJA-562974",
“网址”:“https://www.tinydeal.com/2-in-1-vintage-blue-rhinestone-necklace-earring-jewelry-set-p-180097.html”,
"discounted_price": "11.77 雷亚尔",
“原始价格”:“30.77 雷亚尔”,
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36",
“日期时间”:“20200420 2329”
}

谁能告诉我这些输出哪里出错了?

最佳答案

您如何获取抓取的数据?根据您显示的内容,我怀疑您是从终端复制的。是这样吗?如果是,有一种方法可以使用以下命令将其直接保存到文件中:
scrapy crawl special_offers -o <where save the file>/special_offers.json
希望这可以解决您的问题。请告诉我。

关于python - Scrapy 输出 JSON 或 CSV,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/61335155/

相关文章:

javascript - 通过ejs文件中的javascript访问JSON

python - 在 Windows 上使用 cookiecutter 创建 Django 项目模板

python - "variable or 0"在 python 中是什么意思?

python - 使用 .sort 时对列表进行排序时出错,但在 Python 中使用 sorted() 函数时则不会

excel - 一次返回一个查找值和不同范围的多个对应值

sql - 将数据从 Oracle SQL Developer 导出到 Excel .xlsx

arrays - 从 VBA 字符串数组中获取第 n 个元素

python - 有没有正确的方法让 web2py 使用 Python3 而不是 Python2?

c# - Json.NET:使用双引号反序列化

json - POST 时出现 404 错误。 Angular .js