三天以来,我试图将相应的 start_urs 保存在元属性中,以将其作为项目传递给 scrapy 中的后续请求,因此我可以使用 start_url 调用字典以使用附加数据填充我的输出。实际上它应该很简单,因为它在 documentation 中有解释。 ...
google中有讨论scrapy group还有一个问题here也但是我无法让它运行:(
我是 scrapy 的新手,我认为它是一个很棒的框架,但对于我的项目,我必须知道所有请求的 start_url,而且它看起来相当复杂。
我真的很感谢一些帮助!
目前我的代码如下所示:
class example(CrawlSpider):
name = 'example'
start_urls = ['http://www.example.com']
rules = (
Rule(SgmlLinkExtractor(allow=('/blablabla/', )), callback='parse_item'),
)
def parse(self, response):
for request_or_item in super(example, self).parse(response):
if isinstance(request_or_item, Request):
request_or_item = request_or_item.replace(meta = {'start_url': response.meta['start_url']})
yield request_or_item
def make_requests_from_url(self, url):
return Request(url, dont_filter=True, meta = {'start_url': url})
def parse_item(self, response):
hxs = HtmlXPathSelector(response)
item = testItem()
print response.request.meta, response.url
最佳答案
我想删除这个答案,因为它不能解决OP的问题,但我想将其保留为一个scrapy示例。
Warning :
When writing crawl spider rules, avoid using
parse
as callback, since theCrawlSpider
uses theparse
method itself to implement its logic. So if you override theparse
method, the crawl spider will no longer work.
使用BaseSpider相反:
class Spider(BaseSpider):
name = "domain_spider"
def start_requests(self):
last_domain_id = 0
chunk_size = 10
cursor = settings.db.cursor()
while True:
cursor.execute("""
SELECT domain_id, domain_url
FROM domains
WHERE domain_id > %s AND scraping_started IS NULL
LIMIT %s
""", (last_domain_id, chunk_size))
self.log('Requesting %s domains after %s' % (chunk_size, last_domain_id))
rows = cursor.fetchall()
if not rows:
self.log('No more domains to scrape.')
break
for domain_id, domain_url in rows:
last_domain_id = domain_id
request = self.make_requests_from_url(domain_url)
item = items.Item()
item['start_url'] = domain_url
item['domain_id'] = domain_id
item['domain'] = urlparse.urlparse(domain_url).hostname
request.meta['item'] = item
cursor.execute("""
UPDATE domains
SET scraping_started = %s
WHERE domain_id = %s
""", (datetime.now(), domain_id))
yield request
...
关于python - scrapy传输start_url到后续请求,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/11786259/