python - scrapy 仅抓取网站的一级

标签 python web-crawler scrapy

我正在使用 scrapy 来抓取一个域下的所有网页。

我看过this问题。但是没有解决办法。我的问题似乎是类似的。我的爬网命令输出如下所示:

scrapy crawl sjsu2012-02-22 19:41:35-0800 [scrapy] INFO: Scrapy 0.14.1 started (bot: sjsucrawler)
2012-02-22 19:41:35-0800 [scrapy] DEBUG: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, MemoryUsage, SpiderState
2012-02-22 19:41:35-0800 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware, HttpCompressionMiddleware, ChunkedTransferMiddleware, DownloaderStats
2012-02-22 19:41:35-0800 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2012-02-22 19:41:35-0800 [scrapy] DEBUG: Enabled item pipelines: 
2012-02-22 19:41:35-0800 [sjsu] INFO: Spider opened
2012-02-22 19:41:35-0800 [sjsu] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2012-02-22 19:41:35-0800 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2012-02-22 19:41:35-0800 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2012-02-22 19:41:35-0800 [sjsu] DEBUG: Crawled (200) <GET http://cs.sjsu.edu/> (referer: None)
2012-02-22 19:41:35-0800 [sjsu] INFO: Closing spider (finished)
2012-02-22 19:41:35-0800 [sjsu] INFO: Dumping spider stats:
    {'downloader/request_bytes': 198,
     'downloader/request_count': 1,
     'downloader/request_method_count/GET': 1,
     'downloader/response_bytes': 11000,
     'downloader/response_count': 1,
     'downloader/response_status_count/200': 1,
     'finish_reason': 'finished',
     'finish_time': datetime.datetime(2012, 2, 23, 3, 41, 35, 788155),
     'scheduler/memory_enqueued': 1,
     'start_time': datetime.datetime(2012, 2, 23, 3, 41, 35, 379951)}
2012-02-22 19:41:35-0800 [sjsu] INFO: Spider closed (finished)
2012-02-22 19:41:35-0800 [scrapy] INFO: Dumping global stats:
    {'memusage/max': 29663232, 'memusage/startup': 29663232}

这里的问题是抓取从第一页找到链接,但不访问它们。这样的爬虫有什么用。

编辑:

我的爬虫代码是:

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector

class SjsuSpider(BaseSpider):
    name = "sjsu"
    allowed_domains = ["sjsu.edu"]
    start_urls = [
        "http://cs.sjsu.edu/"
    ]

    def parse(self, response):
        filename = "sjsupages"
        open(filename, 'wb').write(response.body)

我的所有其他设置都是默认设置。

最佳答案

我认为最好的方法是使用 Crawlspider。所以你必须将你的代码修改成下面这样才能找到第一页的所有链接并访问它们:

class SjsuSpider(CrawlSpider):

    name = 'sjsu'
    allowed_domains = ['sjsu.edu']
    start_urls = ['http://cs.sjsu.edu/']
    # allow=() is used to match all links
    rules = [Rule(SgmlLinkExtractor(allow=()), callback='parse_item')]

    def parse_item(self, response):
        x = HtmlXPathSelector(response)

        filename = "sjsupages"
        # open a file to append binary data
        open(filename, 'ab').write(response.body)

如果你想爬取网站中的所有链接(而不仅仅是第一级的链接), 您必须添加一个规则来跟踪每个链接,因此您必须将规则变量更改为 这个:

rules = [
    Rule(SgmlLinkExtractor(allow=()), follow=True),
    Rule(SgmlLinkExtractor(allow=()), callback='parse_item')
]

因此,我已将您的“parse”回调更改为“parse_item”:

When writing crawl spider rules, avoid using parse as callback, since the CrawlSpider uses the parse method itself to implement its logic. So if you override the parse method, the crawl spider will no longer work.

有关更多信息,您可以查看:http://doc.scrapy.org/en/0.14/topics/spiders.html#crawlspider

关于python - scrapy 仅抓取网站的一级,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/9406895/

相关文章:

Scrapy 中的 xpath 语法

python - 如何使用 Scrapy 递归地抓取网站上的每个链接?

python - 转换 Pandas 数据框的数据类型以匹配另一个

python - 如何创建和序列化非托管模型 Django

python - 在 Django 或 Python 中发送带有日历 ICS 附件的文本 + HTML 电子邮件

python - Scrapy 请求在 301 时没有传递给回调?

python - Selenium 单击 href 按钮

python - tf.split 的输出是什么?

python - 如何在 scrapy python 中使用蜘蛛名称动态创建 csv 文件

python - Scrapy:抓取了 0 个页面(在 scrapy shell 中工作,但不适用于 scrapy scrapy Spider 命令)