python - Scrapy项目提取范围问题

标签 python scope scrapy pipeline

我在返回管道中的 Scrapy 项目(玩家)时遇到范围问题。我相当确定我知道问题是什么,但我不确定如何将解决方案集成到我的代码中。我还确信我现在已经正确编写了供管道处理的代码。只是我在 parseRoster() 函数中声明了players 项目,所以我知道它的范围仅限于该函数。

现在我的问题是,我需要在代码中的哪里声明玩家项目才能使其对我的管道可见?我的目标是将这些数据存入我的数据库。我假设它将位于我的代码的主循环中,如果是这种情况,我如何返回项目和我新声明的播放器项目?

我的代码如下:

class NbastatsSpider(scrapy.Spider):
    name = "nbaStats"

    start_urls = [
        "http://espn.go.com/nba/teams"                                                                              ##only start not allowed because had some issues when navigated to team roster pages
        ]
    def parse(self,response):
        items = []                                                                                                  ##array or list that stores TeamStats item
        i=0                                                                                                         ##counter needed for older code

        for division in response.xpath('//div[@id="content"]//div[contains(@class, "mod-teams-list-medium")]'):     
            for team in division.xpath('.//div[contains(@class, "mod-content")]//li'):
                item = TeamStats()
        

                item['division'] = division.xpath('.//div[contains(@class, "mod-header")]/h4/text()').extract()[0]            
                item['team'] = team.xpath('.//h5/a/text()').extract()[0]
                item['rosterurl'] = "http://espn.go.com" + team.xpath('.//div/span[2]/a[3]/@href').extract()[0]
                items.append(item)
                request = scrapy.Request(item['rosterurl'], callback = self.parseWPNow)
                request.meta['play'] = item

                yield request
                
        print(item)      

    def parseWPNow(self, response):
        item = response.meta['play']
        item = self.parseRoster(item, response)

        return item

    def parseRoster(self, item, response):
        players = Player()
        int = 0
        for player in response.xpath("//td[@class='sortcell']"):
            players['name'] = player.xpath("a/text()").extract()[0]
            players['position'] = player.xpath("following-sibling::td[1]").extract()[0]
            players['age'] = player.xpath("following-sibling::td[2]").extract()[0]
            players['height'] = player.xpath("following-sibling::td[3]").extract()[0]
            players['weight'] = player.xpath("following-sibling::td[4]").extract()[0]
            players['college'] = player.xpath("following-sibling::td[5]").extract()[0]
            players['salary'] = player.xpath("following-sibling::td[6]").extract()[0]
            yield players
        item['playerurl'] = response.xpath("//td[@class='sortcell']/a").extract()
        yield item

最佳答案

根据Scrapy's data flow的相关部分:

The Engine sends scraped Items (returned by the Spider) to the Item Pipeline and Requests (returned by spider) to the Scheduler

换句话说,从蜘蛛返回/产生您的项目实例,然后在 process_item() 中使用它们。您的管道的方法。由于您有多个项目类别,请使用 isinstance() built-in function 来区分它们:

def process_item(self, item, spider):
    if isinstance(item, TeamStats):
        # process team stats

    if isinstance(item, Player):
        # process player

关于python - Scrapy项目提取范围问题,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/27832368/

相关文章:

spring - 如何从单例中生成原型(prototype)对象? (需要设计帮助)

python - 我如何跳转到 Scrapy 规则中的下一页

python - HTTP POST 和使用 Scrapy 解析 JSON

python - 使用 Scrapy 提取数据并遇到 css.seletor 问题

python - 存在重复时替换单个字符

python - 将具有多列的数据框 reshape 为行组

javascript - 转义 JavaScript 事件范围

python - 无法在我的 CentOS 6.7 上安装 odoo 9.0

python - 如何在更改窗口大小时更改窗口的布局(tkinter,python)?

r - 改变闭包中的变量