python - 与 BaseSpider 一起使用的正则表达式会导致 CrawlSpider 出现错误

标签 python regex json scrapy

我在 Windows Vista 64 位上使用 Python.org 版本 2.7 64 位。我有以下代码,其中包含名为 Datastore.prime 的 Javascript 项目上的正则表达式,我知道该项目肯定存在于我正在尝试使用 BaseSpider 的静态页面上:

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import Selector
from scrapy.item import Item
from scrapy.spider import BaseSpider
from scrapy import log
from scrapy.cmdline import execute
from scrapy.utils.markup import remove_tags
import time
import re
import json


class ExampleSpider(CrawlSpider):
    name = "goal4"
    allowed_domains = ["whoscored.com"]
    start_urls = ["http://www.whoscored.com"]
    download_delay = 1

    rules = [Rule(SgmlLinkExtractor(allow=('/Teams',)), follow=True, callback='parse_item')]

    def parse_item(self, response):


        playerdata = re.search(re.escape("DataStore.prime('stage-player-stat', defaultTeamPlayerStatsConfigParams.defaultParams , ") \
                     + '(\[.*\])' + re.escape(");"), response.body).group(1)

        for player in json.loads(playerdata):
            print player['FirstName'], player['LastName'], player['TeamName'], player['PositionText'], player['PositionLong'] \
            , player['Age'] \
            , player['Height'], player['Weight'], player['GameStarted'], player['SubOn'], player['SubOff'] \
            , player['Goals'], player['OwnGoals'], player['Assists'], player['Yellow'], player['SecondYellow'], player['Red'] \
            , player['TotalShots'] \
            , player['ShotsOnTarget'], player['ShotsBlocked'], player['TotalPasses'], player['AccuratePasses'], player['KeyPasses'] \
            , player['TotalLongBalls'], player['AccurateLongBalls'], player['TotalThroughBalls'], player['AccurateThroughBalls'] \
            , player['AerialWon'], player['AerialLost'], player['TotalTackles'], player['Interceptions'], player['Fouls'] \
            , player['Offsides'], player['OffsidesWon'], player['TotalClearances'], player['WasDribbled'], player['Dribbles'] \
            , player['WasFouled'] \
            , player['Dispossesed'], player['Turnovers'], player['TotalCrosses'], player['AccurateCrosses'] \

execute(['scrapy','crawl','goal4'])

当此正则表达式用作 CrawlSpider 的一部分(如上例所示)时,代码会抛出以下错误:

 Traceback (most recent call last):
   File "c:\Python27\lib\site-packages\twisted\internet\base.py", line 1201, in mainLoop
     self.runUntilCurrent()
   File "c:\Python27\lib\site-packages\twisted\internet\base.py", line 824, in runUntilCurrent
     call.func(*call.args, **call.kw)
   File "c:\Python27\lib\site-packages\twisted\internet\defer.py", line 383, in callback
     self._startRunCallbacks(result)
   File "c:\Python27\lib\site-packages\twisted\internet\defer.py", line 491, in _startRunCallbacks
     self._runCallbacks()
 --- <exception caught here> ---
   File "c:\Python27\lib\site-packages\twisted\internet\defer.py", line 578, in _runCallbacks
     current.result = callback(current.result, *args, **kw)
   File "c:\Python27\missing\missing\spiders\mrcrawl2.py", line 26, in parse
     + '(\[.*\])' + re.escape(");"), response.body).group(1)
 exceptions.AttributeError: 'NoneType' object has no attribute 'group'

我知道这个示例起作用的静态页面可以在这里找到:

http://www.whoscored.com/Teams/705/Archive/Israel-Maccabi-Haifa 我假设如果 Scrapy 尝试解析未遇到 DateStore.prime 实例的页面,则会导致上述错误。有人可以告诉我是否:

1)这个假设是正确的 2)我如何解决这个问题。我尝试过使用“try:”和“except:”实例,但是我不确定如何编写“如果错误抓取下一页”的代码。

谢谢

最佳答案

问题来自于将方法调用 searchgroup 链接在一起。如果 search 返回 None,则 None.group 引发 AttributeError

相反,将两个方法调用分开并使用if match is not None。例如:

def parse_item(self, response):

    match = re.search(re.escape("DataStore.prime('stage-player-stat', defaultTeamPlayerStatsConfigParams.defaultParams , ") \
                 + '(\[.*\])' + re.escape(");"), response.body)
    if match is not None:
        playerdata = match.group(1)

        for player in json.loads(playerdata):
            ...

关于python - 与 BaseSpider 一起使用的正则表达式会导致 CrawlSpider 出现错误,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/25087072/

相关文章:

javascript - 将字符串转换为嵌套的 JavaScript 对象或 JSON

json - 正确映射动态字段

python - 在工作事务中执行 db.rollback() 之前是否可以检查数据库更新?

regex - golang 正则表达式将所有内容匹配到 "."

python - 从 python 文件运行 sh,语法错误

python - 检查数百万搜索查询中是否存在大量单词的有效方法

java - 如何从管道分隔的键/值列表中检索值?

android - 发布 Json 对象数据以在 android 中使用 volley 获取 Json 数组响应

python - 对keras中的部分张量应用不同的损失函数

python - 如何使用 python-apt 获取包描述?