python - scrapy传输start_url到后续请求

标签 python web-crawler scrapy

三天以来,我试图将相应的 start_urs 保存在元属性中,以将其作为项目传递给 scrapy 中的后续请求,因此我可以使用 start_url 调用字典以使用附加数据填充我的输出。实际上它应该很简单,因为它在 documentation 中有解释。 ...

google中有讨论scrapy group还有一个问题here也但是我无法让它运行:(

我是 scrapy 的新手,我认为它是一个很棒的框架,但对于我的项目,我必须知道所有请求的 start_url,而且它看起来相当复杂。

我真的很感谢一些帮助!

目前我的代码如下所示:

class example(CrawlSpider):

    name = 'example'
    start_urls = ['http://www.example.com']

    rules = (
    Rule(SgmlLinkExtractor(allow=('/blablabla/', )), callback='parse_item'),
    )

    def parse(self, response):
        for request_or_item in super(example, self).parse(response):
            if isinstance(request_or_item, Request):
                request_or_item = request_or_item.replace(meta = {'start_url':   response.meta['start_url']})
            yield request_or_item

    def make_requests_from_url(self, url):
         return Request(url, dont_filter=True, meta = {'start_url': url})


    def parse_item(self, response):
        hxs = HtmlXPathSelector(response)
        item = testItem()
        print response.request.meta, response.url

最佳答案

我想删除这个答案,因为它不能解决OP的问题,但我想将其保留为一个scrapy示例。


Warning :

When writing crawl spider rules, avoid using parse as callback, since the CrawlSpider uses the parse method itself to implement its logic. So if you override the parse method, the crawl spider will no longer work.

使用BaseSpider相反:

class Spider(BaseSpider):

    name = "domain_spider"


    def start_requests(self):

        last_domain_id = 0
        chunk_size = 10
        cursor = settings.db.cursor()

        while True:
            cursor.execute("""
                    SELECT domain_id, domain_url  
                    FROM domains  
                    WHERE domain_id > %s AND scraping_started IS NULL  
                    LIMIT %s
                """, (last_domain_id, chunk_size))
            self.log('Requesting %s domains after %s' % (chunk_size, last_domain_id))
            rows = cursor.fetchall()
            if not rows:
                self.log('No more domains to scrape.')
                break

            for domain_id, domain_url in rows:
                last_domain_id = domain_id
                request = self.make_requests_from_url(domain_url)
                item = items.Item()
                item['start_url'] = domain_url
                item['domain_id'] = domain_id
                item['domain'] = urlparse.urlparse(domain_url).hostname
                request.meta['item'] = item

                cursor.execute("""
                        UPDATE domains  
                        SET scraping_started = %s
                        WHERE domain_id = %s  
                    """, (datetime.now(), domain_id))

                yield request

    ...

关于python - scrapy传输start_url到后续请求,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/11786259/

相关文章:

python - SWIG 从 c 中调用 python 代码

Python 线程模块 - GUI 仍然卡住

python - 安装Scrapy时出现错误 "Could not find ' openssl.exe'

python - 我如何减少这里的 try/catch 语句的数量?

python - 将 CNTK virtualenv 添加到 Visual Studio Python 项目

python - 在 matplotlib 中手动绘制对数间隔的刻度线和标签

javascript - 如何加载外部 .js 调用生成的 HTML?

python - Scrapy 在恢复之前做一些事情

python - 关于一个简单的 python wsgi 服务器(wsgiref.simple_server)的困惑

具有自定义文件保存能力的Java爬虫