python - 使用scrapy创建一个简单的python爬虫

我目前正在尝试使用 Scrapey 在 python 中制作一个简单的爬虫。我想要它做的是读取链接列表并保存它们链接到的网站的 html。现在，我可以获得所有 URL，但无法弄清楚如何下载该页面。这是到目前为止我的蜘蛛的代码:

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from tutorial.items import BookItem

# Book scrappy spider

class DmozSpider(BaseSpider):
    name = "book"
    allowed_domains = ["learnpythonthehardway.org"]
    start_urls = [
        "http://www.learnpythonthehardway.org/book/",
    ]

    def parse(self, response):
        filename = response.url.split("/")[-2]
        file = open(filename,'wb')
        file.write(response.body)
        file.close()

        hxs = HtmlXPathSelector(response)
        sites = hxs.select('//ul/li')
        items = []
        for site in sites:
            item = BookItem()
            item['title'] = site.select('a/text()').extract()
            item['link'] = site.select('a/@href').extract()
            items.append(item)
        return items

最佳答案

在您的 parse 方法中，返回返回项目列表中的 Request 对象以触发下载:

for site in sites:
    ...
    items.append(item)
    items.append(Request(item['link']), callback=self.parse)

这将导致爬虫为每个链接生成一个 BookItem，同时还会递归并下载每本书的页面。当然，如果您想以不同的方式解析子页面，您可以指定不同的回调(例如 self.parsebook)。

关于python - 使用scrapy创建一个简单的python爬虫，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/12153671/

上一篇：python - Py2Exe本地服务器不执行CGI

下一篇：python - 如何使用 Django 中的 QuerySet 中的 startTime 和 endTime 字段计算持续时间？

相关文章：

python - pylint PyQt4错误

python - 退出 : scrapy (exit status 0; not expected)

multithreading - akka actor 之间的工作负载平衡

web-crawler - 信息检索——寻找术语同义词

java - 基于不同服务构建请求的最佳方式/模式

python - Beautifulsoup for row 循环只运行一次？

python - 如何在 scrapy python 中使用蜘蛛名称动态创建 csv 文件

web-scraping - 如何在Scrapy中处理429个请求过多？

ajax - 使用 ajax 加载内容时为谷歌爬虫创建默认 View

python - web2py:我应该在哪里存储私有(private)的、特定于应用程序的文件？