python - scrapy 仅抓取网站的一级

我正在使用 scrapy 来抓取一个域下的所有网页。

我看过this问题。但是没有解决办法。我的问题似乎是类似的。我的爬网命令输出如下所示:

scrapy crawl sjsu2012-02-22 19:41:35-0800 [scrapy] INFO: Scrapy 0.14.1 started (bot: sjsucrawler)
2012-02-22 19:41:35-0800 [scrapy] DEBUG: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, MemoryUsage, SpiderState
2012-02-22 19:41:35-0800 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware, HttpCompressionMiddleware, ChunkedTransferMiddleware, DownloaderStats
2012-02-22 19:41:35-0800 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2012-02-22 19:41:35-0800 [scrapy] DEBUG: Enabled item pipelines: 
2012-02-22 19:41:35-0800 [sjsu] INFO: Spider opened
2012-02-22 19:41:35-0800 [sjsu] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2012-02-22 19:41:35-0800 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2012-02-22 19:41:35-0800 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2012-02-22 19:41:35-0800 [sjsu] DEBUG: Crawled (200) <GET http://cs.sjsu.edu/> (referer: None)
2012-02-22 19:41:35-0800 [sjsu] INFO: Closing spider (finished)
2012-02-22 19:41:35-0800 [sjsu] INFO: Dumping spider stats:
    {'downloader/request_bytes': 198,
     'downloader/request_count': 1,
     'downloader/request_method_count/GET': 1,
     'downloader/response_bytes': 11000,
     'downloader/response_count': 1,
     'downloader/response_status_count/200': 1,
     'finish_reason': 'finished',
     'finish_time': datetime.datetime(2012, 2, 23, 3, 41, 35, 788155),
     'scheduler/memory_enqueued': 1,
     'start_time': datetime.datetime(2012, 2, 23, 3, 41, 35, 379951)}
2012-02-22 19:41:35-0800 [sjsu] INFO: Spider closed (finished)
2012-02-22 19:41:35-0800 [scrapy] INFO: Dumping global stats:
    {'memusage/max': 29663232, 'memusage/startup': 29663232}

这里的问题是抓取从第一页找到链接，但不访问它们。这样的爬虫有什么用。

编辑:

我的爬虫代码是:

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector

class SjsuSpider(BaseSpider):
    name = "sjsu"
    allowed_domains = ["sjsu.edu"]
    start_urls = [
        "http://cs.sjsu.edu/"
    ]

    def parse(self, response):
        filename = "sjsupages"
        open(filename, 'wb').write(response.body)

我的所有其他设置都是默认设置。

最佳答案

我认为最好的方法是使用 Crawlspider。所以你必须将你的代码修改成下面这样才能找到第一页的所有链接并访问它们:

class SjsuSpider(CrawlSpider):

    name = 'sjsu'
    allowed_domains = ['sjsu.edu']
    start_urls = ['http://cs.sjsu.edu/']
    # allow=() is used to match all links
    rules = [Rule(SgmlLinkExtractor(allow=()), callback='parse_item')]

    def parse_item(self, response):
        x = HtmlXPathSelector(response)

        filename = "sjsupages"
        # open a file to append binary data
        open(filename, 'ab').write(response.body)

如果你想爬取网站中的所有链接(而不仅仅是第一级的链接)，您必须添加一个规则来跟踪每个链接，因此您必须将规则变量更改为这个:

rules = [
    Rule(SgmlLinkExtractor(allow=()), follow=True),
    Rule(SgmlLinkExtractor(allow=()), callback='parse_item')
]

因此，我已将您的“parse”回调更改为“parse_item”:

When writing crawl spider rules, avoid using parse as callback, since the CrawlSpider uses the parse method itself to implement its logic. So if you override the parse method, the crawl spider will no longer work.

有关更多信息，您可以查看:http://doc.scrapy.org/en/0.14/topics/spiders.html#crawlspider

关于python - scrapy 仅抓取网站的一级，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/9406895/

python - scrapy 仅抓取网站的一级

上一篇：python - 当我们有 python 的内置 SimpleHttpServer 时，是否需要 apache

下一篇：python - 使用 python 和 PIL 如何抓取图像中的文本 block ？