python - Scrapy python csv输出的每一行之间都有空白行

原文 标签 python csv web-scraping scrapy

在生成的csv输出文件中,每行残缺输出之间都有多余的空行。
我已经从python2迁移到python 3,并且使用windows 10。因此,我正在为蟒蛇3号修改我的残缺项目。
我目前(现在也是唯一)的问题是,当我将残缺的输出写入csv文件时,每行之间都会有一个空行。这已经在这里的几个帖子中被强调(这是与windows有关的),但是我无法得到一个解决方案来工作。
碰巧,我还向piplines.py文件中添加了一些代码,以确保csv输出按给定的列顺序而不是随机顺序。因此,我可以使用normalscrapy crawl charleschurch来运行此代码,而不是scrapy crawl charleschurch -o charleschurch2017xxxx.csv
有人知道如何跳过/省略csv输出中的这一空行吗?
下面是我的pipelines.py代码(我可能不需要import csv行,但我怀疑我可能需要这样做才能得到最终答案):

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html

import csv
from scrapy import signals
from scrapy.exporters import CsvItemExporter

class CSVPipeline(object):

  def __init__(self):
    self.files = {}

  @classmethod
  def from_crawler(cls, crawler):
    pipeline = cls()
    crawler.signals.connect(pipeline.spider_opened, signals.spider_opened)
    crawler.signals.connect(pipeline.spider_closed, signals.spider_closed)
    return pipeline

  def spider_opened(self, spider):
    file = open('%s_items.csv' % spider.name, 'w+b')
    self.files[spider] = file
    self.exporter = CsvItemExporter(file)
    self.exporter.fields_to_export = ["plotid","plotprice","plotname","name","address"]
    self.exporter.start_exporting()

  def spider_closed(self, spider):
    self.exporter.finish_exporting()
    file = self.files.pop(spider)
    file.close()

  def process_item(self, item, spider):
    self.exporter.export_item(item)
    return item

我将这一行添加到settings.py文件(不确定300的相关性):
ITEM_PIPELINES = {'CharlesChurch.pipelines.CSVPipeline': 300 }

我的代码如下:
import scrapy
from urllib.parse import urljoin

from CharlesChurch.items import CharleschurchItem

class charleschurchSpider(scrapy.Spider):
    name = "charleschurch"
    allowed_domains = ["charleschurch.com"]    
    start_urls = ["https://www.charleschurch.com/county-durham_willington/the-ridings-1111"]


    def parse(self, response):

        for sel in response.xpath('//*[@id="aspnetForm"]/div[4]'):
           item = CharleschurchItem()
           item['name'] = sel.xpath('//*[@id="XplodePage_ctl12_dsDetailsSnippet_pDetailsContainer"]/span[1]/b/text()').extract()
           item['address'] = sel.xpath('//*[@id="XplodePage_ctl12_dsDetailsSnippet_pDetailsContainer"]/div/*[@itemprop="postalCode"]/text()').extract()
           plotnames = sel.xpath('//div[@class="housetype js-filter-housetype"]/div[@class="housetype__col-2"]/div[@class="housetype__plots"]/div[not(contains(@data-status,"Sold"))]/div[@class="plot__name"]/a/text()').extract()
           plotnames = [plotname.strip() for plotname in plotnames]
           plotids = sel.xpath('//div[@class="housetype js-filter-housetype"]/div[@class="housetype__col-2"]/div[@class="housetype__plots"]/div[not(contains(@data-status,"Sold"))]/div[@class="plot__name"]/a/@href').extract()
           plotids = [plotid.strip() for plotid in plotids]
           plotprices = sel.xpath('//div[@class="housetype js-filter-housetype"]/div[@class="housetype__col-2"]/div[@class="housetype__plots"]/div[not(contains(@data-status,"Sold"))]/div[@class="plot__price"]/text()').extract()
           plotprices = [plotprice.strip() for plotprice in plotprices]
           result = zip(plotnames, plotids, plotprices)
           for plotname, plotid, plotprice in result:
               item['plotname'] = plotname
               item['plotid'] = plotid
               item['plotprice'] = plotprice
               yield item

最佳答案

我怀疑不太理想,但我找到了解决这个问题的办法。在pipelines.py文件中,我添加了更多的代码,这些代码基本上是读取csv文件和列表中的空行,从而删除空行,然后将清除的列表写入新文件。
我添加的代码是:

with open('%s_items.csv' % spider.name, 'r') as f:
  reader = csv.reader(f)
  original_list = list(reader)
  cleaned_list = list(filter(None,original_list))

with open('%s_items_cleaned.csv' % spider.name, 'w', newline='') as output_file:
    wr = csv.writer(output_file, dialect='excel')
    for data in cleaned_list:
      wr.writerow(data)

因此,整个pipelines.py文件是:
# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html

import csv
from scrapy import signals
from scrapy.exporters import CsvItemExporter

class CSVPipeline(object):

  def __init__(self):
    self.files = {}

  @classmethod
  def from_crawler(cls, crawler):
    pipeline = cls()
    crawler.signals.connect(pipeline.spider_opened, signals.spider_opened)
    crawler.signals.connect(pipeline.spider_closed, signals.spider_closed)
    return pipeline

  def spider_opened(self, spider):
    file = open('%s_items.csv' % spider.name, 'w+b')
    self.files[spider] = file
    self.exporter = CsvItemExporter(file)
    self.exporter.fields_to_export = ["plotid","plotprice","plotname","name","address"]
    self.exporter.start_exporting()

  def spider_closed(self, spider):
    self.exporter.finish_exporting()
    file = self.files.pop(spider)
    file.close()

    #given I am using Windows i need to elimate the blank lines in the csv file
    print("Starting csv blank line cleaning")
    with open('%s_items.csv' % spider.name, 'r') as f:
      reader = csv.reader(f)
      original_list = list(reader)
      cleaned_list = list(filter(None,original_list))

    with open('%s_items_cleaned.csv' % spider.name, 'w', newline='') as output_file:
        wr = csv.writer(output_file, dialect='excel')
        for data in cleaned_list:
          wr.writerow(data)

  def process_item(self, item, spider):
    self.exporter.export_item(item)
    return item


class CharleschurchPipeline(object):
    def process_item(self, item, spider):
        return item

不是很理想,但现在解决了这个问题。

相关文章:

java - Java-JFrame保存多行

python - 无法使用scrapy用此代码提取任何数据[关闭]

python - 在python中合并文件时,csv中双引号不断出现,如何删除?

python - 为什么Python在找不到子字符串时会抛出错误?

python - 文件较大时CSV文件出现问题

php - 当我从CSV文件插入数据提取时如何防止MYSQL表中的重复记录

python - 当带有extract()的Scrapy选择器返回None时,如何设置默认值?

r - rvest :: html_text和RSelenium :: getPageSource有什么区别?

python - Python-从变化的文本文件更新实时图形

python - 返回在字典中的键上循环的第一个值输入的最大值