python - Scrapy python csv 输出在每行之间有空行

标签 python csv web-scraping scrapy

我在生成的 csv 输出文件中的每一行 scrapy 输出之间得到不需要的空行。

我已经从 python2 迁移到 python 3,并且我使用的是 Windows 10。因此,我正在为 python3 调整我的 scrapy 项目。

我当前(也是目前唯一的)问题是,当我将 scrapy 输出写入 CSV 文件时,每行之间有一个空行。这已在此处的多个帖子中突出显示(与 Windows 相关),但我无法找到有效的解决方案。

碰巧的是,我还在 piplines.py 文件中添加了一些代码,以确保 csv 输出按给定的列顺序而不是随机顺序。因此,我可以使用普通的 scrapy crawl charleschurch 来运行这段代码,而不是 scrapy crawl charleschurch -o charleschurch2017xxxx.csv

有谁知道如何在 CSV 输出中跳过/省略此空白行?

我的 pipelines.py 代码如下(我可能不需要 import csv 行,但我想我可能会为最终答案做):

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html

import csv
from scrapy import signals
from scrapy.exporters import CsvItemExporter

class CSVPipeline(object):

  def __init__(self):
    self.files = {}

  @classmethod
  def from_crawler(cls, crawler):
    pipeline = cls()
    crawler.signals.connect(pipeline.spider_opened, signals.spider_opened)
    crawler.signals.connect(pipeline.spider_closed, signals.spider_closed)
    return pipeline

  def spider_opened(self, spider):
    file = open('%s_items.csv' % spider.name, 'w+b')
    self.files[spider] = file
    self.exporter = CsvItemExporter(file)
    self.exporter.fields_to_export = ["plotid","plotprice","plotname","name","address"]
    self.exporter.start_exporting()

  def spider_closed(self, spider):
    self.exporter.finish_exporting()
    file = self.files.pop(spider)
    file.close()

  def process_item(self, item, spider):
    self.exporter.export_item(item)
    return item

我将这一行添加到 settings.py 文件中(不确定 300 的相关性):

ITEM_PIPELINES = {'CharlesChurch.pipelines.CSVPipeline': 300 }

我的 scrapy 代码如下:

import scrapy
from urllib.parse import urljoin

from CharlesChurch.items import CharleschurchItem

class charleschurchSpider(scrapy.Spider):
    name = "charleschurch"
    allowed_domains = ["charleschurch.com"]    
    start_urls = ["https://www.charleschurch.com/county-durham_willington/the-ridings-1111"]


    def parse(self, response):

        for sel in response.xpath('//*[@id="aspnetForm"]/div[4]'):
           item = CharleschurchItem()
           item['name'] = sel.xpath('//*[@id="XplodePage_ctl12_dsDetailsSnippet_pDetailsContainer"]/span[1]/b/text()').extract()
           item['address'] = sel.xpath('//*[@id="XplodePage_ctl12_dsDetailsSnippet_pDetailsContainer"]/div/*[@itemprop="postalCode"]/text()').extract()
           plotnames = sel.xpath('//div[@class="housetype js-filter-housetype"]/div[@class="housetype__col-2"]/div[@class="housetype__plots"]/div[not(contains(@data-status,"Sold"))]/div[@class="plot__name"]/a/text()').extract()
           plotnames = [plotname.strip() for plotname in plotnames]
           plotids = sel.xpath('//div[@class="housetype js-filter-housetype"]/div[@class="housetype__col-2"]/div[@class="housetype__plots"]/div[not(contains(@data-status,"Sold"))]/div[@class="plot__name"]/a/@href').extract()
           plotids = [plotid.strip() for plotid in plotids]
           plotprices = sel.xpath('//div[@class="housetype js-filter-housetype"]/div[@class="housetype__col-2"]/div[@class="housetype__plots"]/div[not(contains(@data-status,"Sold"))]/div[@class="plot__price"]/text()').extract()
           plotprices = [plotprice.strip() for plotprice in plotprices]
           result = zip(plotnames, plotids, plotprices)
           for plotname, plotid, plotprice in result:
               item['plotname'] = plotname
               item['plotid'] = plotid
               item['plotprice'] = plotprice
               yield item

最佳答案

我怀疑这不太理想,但我找到了解决此问题的方法。在 pipelines.py 文件中,我添加了更多代码,这些代码实质上是将带有空行的 csv 文件读取到列表中,因此删除空行,然后将清理后的列表写入新文件。

我添加的代码是:

with open('%s_items.csv' % spider.name, 'r') as f:
  reader = csv.reader(f)
  original_list = list(reader)
  cleaned_list = list(filter(None,original_list))

with open('%s_items_cleaned.csv' % spider.name, 'w', newline='') as output_file:
    wr = csv.writer(output_file, dialect='excel')
    for data in cleaned_list:
      wr.writerow(data)

所以整个 pipelines.py 文件是:

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html

import csv
from scrapy import signals
from scrapy.exporters import CsvItemExporter

class CSVPipeline(object):

  def __init__(self):
    self.files = {}

  @classmethod
  def from_crawler(cls, crawler):
    pipeline = cls()
    crawler.signals.connect(pipeline.spider_opened, signals.spider_opened)
    crawler.signals.connect(pipeline.spider_closed, signals.spider_closed)
    return pipeline

  def spider_opened(self, spider):
    file = open('%s_items.csv' % spider.name, 'w+b')
    self.files[spider] = file
    self.exporter = CsvItemExporter(file)
    self.exporter.fields_to_export = ["plotid","plotprice","plotname","name","address"]
    self.exporter.start_exporting()

  def spider_closed(self, spider):
    self.exporter.finish_exporting()
    file = self.files.pop(spider)
    file.close()

    #given I am using Windows i need to elimate the blank lines in the csv file
    print("Starting csv blank line cleaning")
    with open('%s_items.csv' % spider.name, 'r') as f:
      reader = csv.reader(f)
      original_list = list(reader)
      cleaned_list = list(filter(None,original_list))

    with open('%s_items_cleaned.csv' % spider.name, 'w', newline='') as output_file:
        wr = csv.writer(output_file, dialect='excel')
        for data in cleaned_list:
          wr.writerow(data)

  def process_item(self, item, spider):
    self.exporter.export_item(item)
    return item


class CharleschurchPipeline(object):
    def process_item(self, item, spider):
        return item

不理想,但暂时解决了问题。

关于python - Scrapy python csv 输出在每行之间有空行,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/43472847/

相关文章:

python - 有没有更好的方法从 Python 中的文件中读取元素?

javascript - 无法在函数中应用抓取下一页的逻辑

Python + 命令行字符串操作我应该采取什么不同的做法

Java 扫描器 Csv useDelimiter

python - python 中读取 csv、处理每一行并编写新 csv 的最快方法

python - 如何使用单个脚本从具有不同源代码的不同站点抓取数据?

vba - 使用 Selenium Basic (VBA) 循环访问一组页面

python - 在具有条件增量的 pandas 数据框上使用 cumcount

python - 允许自定义 python 类在两个方向上添加

python - 程序输入 4 个科目的 5 个学生分数,并输出学生和科目的最高平均分