我在 Windows Vista 64 位上使用 Python.org 版本 2.7 64 位。我有以下代码,其中包含名为 Datastore.prime 的 Javascript 项目上的正则表达式,我知道该项目肯定存在于我正在尝试使用 BaseSpider 的静态页面上:
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import Selector
from scrapy.item import Item
from scrapy.spider import BaseSpider
from scrapy import log
from scrapy.cmdline import execute
from scrapy.utils.markup import remove_tags
import time
import re
import json
class ExampleSpider(CrawlSpider):
name = "goal4"
allowed_domains = ["whoscored.com"]
start_urls = ["http://www.whoscored.com"]
download_delay = 1
rules = [Rule(SgmlLinkExtractor(allow=('/Teams',)), follow=True, callback='parse_item')]
def parse_item(self, response):
playerdata = re.search(re.escape("DataStore.prime('stage-player-stat', defaultTeamPlayerStatsConfigParams.defaultParams , ") \
+ '(\[.*\])' + re.escape(");"), response.body).group(1)
for player in json.loads(playerdata):
print player['FirstName'], player['LastName'], player['TeamName'], player['PositionText'], player['PositionLong'] \
, player['Age'] \
, player['Height'], player['Weight'], player['GameStarted'], player['SubOn'], player['SubOff'] \
, player['Goals'], player['OwnGoals'], player['Assists'], player['Yellow'], player['SecondYellow'], player['Red'] \
, player['TotalShots'] \
, player['ShotsOnTarget'], player['ShotsBlocked'], player['TotalPasses'], player['AccuratePasses'], player['KeyPasses'] \
, player['TotalLongBalls'], player['AccurateLongBalls'], player['TotalThroughBalls'], player['AccurateThroughBalls'] \
, player['AerialWon'], player['AerialLost'], player['TotalTackles'], player['Interceptions'], player['Fouls'] \
, player['Offsides'], player['OffsidesWon'], player['TotalClearances'], player['WasDribbled'], player['Dribbles'] \
, player['WasFouled'] \
, player['Dispossesed'], player['Turnovers'], player['TotalCrosses'], player['AccurateCrosses'] \
execute(['scrapy','crawl','goal4'])
当此正则表达式用作 CrawlSpider 的一部分(如上例所示)时,代码会抛出以下错误:
Traceback (most recent call last):
File "c:\Python27\lib\site-packages\twisted\internet\base.py", line 1201, in mainLoop
self.runUntilCurrent()
File "c:\Python27\lib\site-packages\twisted\internet\base.py", line 824, in runUntilCurrent
call.func(*call.args, **call.kw)
File "c:\Python27\lib\site-packages\twisted\internet\defer.py", line 383, in callback
self._startRunCallbacks(result)
File "c:\Python27\lib\site-packages\twisted\internet\defer.py", line 491, in _startRunCallbacks
self._runCallbacks()
--- <exception caught here> ---
File "c:\Python27\lib\site-packages\twisted\internet\defer.py", line 578, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "c:\Python27\missing\missing\spiders\mrcrawl2.py", line 26, in parse
+ '(\[.*\])' + re.escape(");"), response.body).group(1)
exceptions.AttributeError: 'NoneType' object has no attribute 'group'
我知道这个示例起作用的静态页面可以在这里找到:
http://www.whoscored.com/Teams/705/Archive/Israel-Maccabi-Haifa 我假设如果 Scrapy 尝试解析未遇到 DateStore.prime 实例的页面,则会导致上述错误。有人可以告诉我是否:
1)这个假设是正确的 2)我如何解决这个问题。我尝试过使用“try:”和“except:”实例,但是我不确定如何编写“如果错误抓取下一页”的代码。
谢谢
最佳答案
问题来自于将方法调用 search
和 group
链接在一起。如果 search
返回 None
,则 None.group
引发 AttributeError
。
相反,将两个方法调用分开并使用if match is not None
。例如:
def parse_item(self, response):
match = re.search(re.escape("DataStore.prime('stage-player-stat', defaultTeamPlayerStatsConfigParams.defaultParams , ") \
+ '(\[.*\])' + re.escape(");"), response.body)
if match is not None:
playerdata = match.group(1)
for player in json.loads(playerdata):
...
关于python - 与 BaseSpider 一起使用的正则表达式会导致 CrawlSpider 出现错误,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/25087072/