python - 从脚本运行时,Scrapy 蜘蛛结果无法通过管道传输到数据库中

标签 python postgresql sqlalchemy web-scraping scrapy

<分区>

我已经编写了一个 Scrapy 蜘蛛,我试图从位于另一个目录中的 python 脚本运行它。我使用的代码来自 docs似乎在运行蜘蛛,但是当我检查 postgresql 表时,它还没有被创建。如果我使用 scrapy crawl 命令,蜘蛛只会正确地管道化抓取的数据。我试过将脚本放在 scrapy 项目正上方的目录中,也放在与配置文件相同的目录中,但似乎都没有创建表。

下面是脚本的代码,后面是蜘蛛的代码。我认为问题涉及脚本应该放置的目录和/或我在蜘蛛文件中使用的代码,以使蜘蛛能够从脚本运行,但我不确定。脚本中调用的函数看起来是否有问题,或者设置文件中是否需要更改某些内容?如果需要,我可以提供管道文件的代码,谢谢。

脚本文件(只有3行)

from ticket_city_scraper import *
from ticket_city_scraper.spiders import tc_spider 

tc_spider.spiderCrawl()

蜘蛛文件

import scrapy
import re
import json
from scrapy.crawler import CrawlerProcess
from scrapy import Request
from scrapy.contrib.spiders import CrawlSpider , Rule
from scrapy.selector import HtmlXPathSelector
from scrapy.selector import Selector
from scrapy.contrib.loader import ItemLoader
from scrapy.contrib.loader import XPathItemLoader
from scrapy.contrib.loader.processor import Join, MapCompose
from ticket_city_scraper.items import ComparatorItem
from urlparse import urljoin

from scrapy.utils.project import get_project_settings
from scrapy.crawler import CrawlerRunner
from twisted.internet import reactor, defer
from scrapy.utils.log import configure_logging



bandname = raw_input("Enter bandname\n")
tc_url = "https://www.ticketcity.com/concerts/" + bandname + "-tickets.html"  

class MySpider3(CrawlSpider):
    handle_httpstatus_list = [416]
    name = 'comparator'
    allowed_domains = ["www.ticketcity.com"]

    start_urls = [tc_url]
    tickets_list_xpath = './/div[@class = "vevent"]'
    def create_link(self, bandname):
        tc_url = "https://www.ticketcity.com/concerts/" + bandname + "-tickets.html"  
        self.start_urls = [tc_url]
        #return tc_url      

    tickets_list_xpath = './/div[@class = "vevent"]'

    def parse_json(self, response):
        loader = response.meta['loader']
        jsonresponse = json.loads(response.body_as_unicode())
        ticket_info = jsonresponse.get('B')
        price_list = [i.get('P') for i in ticket_info]
        if len(price_list) > 0:
            str_Price = str(price_list[0])
            ticketPrice = unicode(str_Price, "utf-8")
            loader.add_value('ticketPrice', ticketPrice)
        else:
            ticketPrice = unicode("sold out", "utf-8")
            loader.add_value('ticketPrice', ticketPrice)
        return loader.load_item()

    def parse_price(self, response):
        print "parse price function entered \n"
        loader = response.meta['loader']
        event_City = response.xpath('.//span[@itemprop="addressLocality"]/text()').extract() 
        eventCity = ''.join(event_City) 
        loader.add_value('eventCity' , eventCity)
        event_State = response.xpath('.//span[@itemprop="addressRegion"]/text()').extract() 
        eventState = ''.join(event_State) 
        loader.add_value('eventState' , eventState) 
        event_Date = response.xpath('.//span[@class="event_datetime"]/text()').extract() 
        eventDate = ''.join(event_Date)  
        loader.add_value('eventDate' , eventDate)    
        ticketsLink = loader.get_output_value("ticketsLink")
        json_id_list= re.findall(r"(\d+)[^-]*$", ticketsLink)
        json_id=  "".join(json_id_list)
        json_url = "https://www.ticketcity.com/Catalog/public/v1/events/" + json_id + "/ticketblocks?P=0,99999999&q=0&per_page=250&page=1&sort=p.asc&f.t=s&_=1436642392938"
        yield scrapy.Request(json_url, meta={'loader': loader}, callback = self.parse_json, dont_filter = True) 

    def parse(self, response):
        """
        # """
        selector = HtmlXPathSelector(response)
        # iterate over tickets
        for ticket in selector.select(self.tickets_list_xpath):
            loader = XPathItemLoader(ComparatorItem(), selector=ticket)
            # define loader
            loader.default_input_processor = MapCompose(unicode.strip)
            loader.default_output_processor = Join()
            # iterate over fields and add xpaths to the loader
            loader.add_xpath('eventName' , './/span[@class="summary listingEventName"]/text()')
            loader.add_xpath('eventLocation' , './/div[@class="divVenue location"]/text()')
            loader.add_xpath('ticketsLink' , './/a[@class="divEventDetails url"]/@href')
            #loader.add_xpath('eventDateTime' , '//div[@id="divEventDate"]/@title') #datetime type
            #loader.add_xpath('eventTime' , './/*[@class = "productionsTime"]/text()')

            print "Here is ticket link \n" + loader.get_output_value("ticketsLink")
            #sel.xpath("//span[@id='PractitionerDetails1_Label4']/text()").extract()
            ticketsURL = "https://www.ticketcity.com/" + loader.get_output_value("ticketsLink")
            ticketsURL = urljoin(response.url, ticketsURL)
            yield scrapy.Request(ticketsURL, meta={'loader': loader}, callback = self.parse_price, dont_filter = True)

def spiderCrawl():
   process = CrawlerProcess({
    'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
   })
   process.crawl(MySpider3)
   process.start()

最佳答案

这是因为您的设置对象只包含一个用户代理。您的项目设置决定了运行哪个管道。来自 scrapy 文档:

You can automatically import your spiders passing their name to CrawlerProcess, and use get_project_settings to get a Settings instance with your project settings.

更多信息在这里 http://doc.scrapy.org/en/latest/topics/practices.html

阅读比第一个例子更多的内容。

关于python - 从脚本运行时,Scrapy 蜘蛛结果无法通过管道传输到数据库中,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/31618473/

相关文章:

python - Libvirt Python SSH 连接超时

postgresql - 更新 postgres 触发器中的值

Python散点图索引错误

ruby-on-rails-3 - 对 has_one 关联中的字符串字段进行 Sphinx 过滤的思考

sql - 找不到 postgresql 函数

python - 如何在 SQLAlchemy 中动态选择要查询的列?

python - 类型错误 : can only concatenate list (not "str") to list | Alembic Migration

python - 从最后一个索引开始查询数据库 SQLAlchemy

python - 打印所有组合,python

python - sess.run(Tensor()) 不执行任何操作