尝试了出现在 documentation page 上的示例 scrapy 用法 (名称下的示例:从单个回调中返回多个请求和项目)
我只是将域更改为指向一个真实的网站:
import scrapy
class MySpider(scrapy.Spider):
name = 'huffingtonpost'
allowed_domains = ['huffingtonpost.com/']
start_urls = [
'http://www.huffingtonpost.com/politics/',
'http://www.huffingtonpost.com/entertainment/',
'http://www.huffingtonpost.com/media/',
]
def parse(self, response):
for h3 in response.xpath('//h3').extract():
yield {"title": h3}
for url in response.xpath('//a/@href').extract():
yield scrapy.Request(url, callback=self.parse)
但得到了 ValuError
,如 this gist 中所述.
有什么想法吗?
最佳答案
一些提取的链接是相对的(例如,/news/hillary-clinton/
)。
您应该将其转换为绝对 (http://www.huffingtonpost.com/news/hillary-clinton/
import scrapy
class MySpider(scrapy.Spider):
name = 'huffingtonpost'
allowed_domains = ['huffingtonpost.com/']
start_urls = [
'http://www.huffingtonpost.com/politics/',
'http://www.huffingtonpost.com/entertainment/',
'http://www.huffingtonpost.com/media/',
]
def parse(self, response):
for h3 in response.xpath('//h3').extract():
yield {"title": h3}
for url in response.xpath('//a/@href').extract():
if url.startswith('/'):
# transform url into absolute
url = 'http://www.huffingtonpost.com' + url
if url.startswith('#'):
# ignore href starts with #
continue
yield scrapy.Request(url, callback=self.parse)
关于python - 官方scrapy例子出错?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/32906873/