在这个bug上卡了一段时间,下面的错误信息如下:
File "C:\Python27\lib\site-packages\scrapy-0.20.2-py2.7.egg\scrapy\http\request\__init__.py", line 61, in _set_url
raise ValueError('Missing scheme in request url: %s' % self._url)
exceptions.ValueError: Missing scheme in request url: h
抓取代码:
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import Selector
from scrapy.http import Request
from spyder.items import SypderItem
import sys
import MySQLdb
import hashlib
from scrapy import signals
from scrapy.xlib.pydispatch import dispatcher
# _*_ coding: utf-8 _*_
class some_Spyder(CrawlSpider):
name = "spyder"
def __init__(self, *a, **kw):
# catch the spider stopping
# dispatcher.connect(self.spider_closed, signals.spider_closed)
# dispatcher.connect(self.on_engine_stopped, signals.engine_stopped)
self.allowed_domains = "domainname.com"
self.start_urls = "http://www.domainname.com/"
self.xpaths = '''//td[@class="CatBg" and @width="25%"
and @valign="top" and @align="center"]
/table[@cellspacing="0"]//tr/td/a/@href'''
self.rules = (
Rule(SgmlLinkExtractor(restrict_xpaths=(self.xpaths))),
Rule(SgmlLinkExtractor(allow=('cart.php?')), callback='parse_items'),
)
super(spyder, self).__init__(*a, **kw)
def parse_items(self, response):
sel = Selector(response)
items = []
listings = sel.xpath('//*[@id="tabContent"]/table/tr')
item = IgeItem()
item["header"] = sel.xpath('//td[@valign="center"]/h1/text()')
items.append(item)
return items
我很确定这与我要求 scrapy 在 LinkExtractor 中跟踪的 URL 有关。在 shell 中提取它们时,它们看起来像这样:
data=u'cart.php?target=category&category_id=826'
与从工作蜘蛛中提取的另一个 URL 相比:
data=u'/path/someotherpath/category.php?query=someval'
我看过一些关于 Stack Overflow 的问题,例如 Downloading pictures with scrapy但从阅读中我想我可能有一个稍微不同的问题。
我也看过这个 - http://static.scrapy.org/coverage-report/scrapy_http_request___init__.html
这解释了如果 self.URLs 缺少“:”则会引发错误,通过查看我定义的 start_urls 我不太明白为什么会显示此错误,因为该方案已明确定义。
最佳答案
将 start_urls
更改为:
self.start_urls = ["http://www.bankofwow.com/"]
关于python - 请求 URL 中缺少方案,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/21103533/