python - Scrapy爬虫无法从多个页面爬取数据

我正在尝试废弃以下页面的结果:

http://www.peekyou.com/work/autodesk/page=1

页面 = 1,2,3,4 ... 根据结果依此类推。所以我得到一个 php 文件来运行爬虫程序，运行它以获取不同的页码。代码(针对单个页面)如下:

`import sys
 from scrapy.spider import BaseSpider
 from scrapy.selector import HtmlXPathSelector
 from scrapy.contrib.spiders import CrawlSpider, Rule
 from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
 from scrapy.selector import HtmlXPathSelector
 from scrapy.item import Item
 from scrapy.http import Request
 #from scrapy.crawler import CrawlerProcess

 class DmozSpider(BaseSpider):
 name = "peekyou_crawler"

 start_urls = ["http://www.peekyou.com/work/autodesk/page=1"];

 def parse(self, response):

     hxs = HtmlXPathSelector(response)

     discovery = hxs.select('//div[@class="nextPage"]/table/tr[2]/td/a[contains(@title,"Next")]')
     print len(discovery)

     print "Starting the actual file"
     items = hxs.select('//div[@class="resultCell"]')
     count = 0
     for newsItem in items:
        print newsItem

        url=newsItem.select('h2/a/@href').extract()
        name = newsItem.select('h2/a/span/text()').extract()
        count = count + 1
        print count
        print url[0]
        print name[0]

        print "\n"

` Autodesk 结果页面有 18 页。当我运行代码爬取所有页面时，爬虫只获取第 2 页的数据，而不是所有页面的数据。同样，我将公司名称更改为其他名称。同样，它会删除一些页面，而不会删除其他页面。不过，我在每个页面上都收到 http 响应 200。另外，即使我再次运行它，它仍然会继续废弃相同的页面，但并非总是如此。知道我的方法可能有什么错误或者我遗漏了什么吗？

提前致谢。

最佳答案

您可以添加更多地址:

start_urls = [
    "http://www.peekyou.com/work/autodesk/page=1",
    "http://www.peekyou.com/work/autodesk/page=2",
    "http://www.peekyou.com/work/autodesk/page=3"
];

您可以生成更多地址:

start_urls = [
    "http://www.peekyou.com/work/autodesk/page=%d" % i for i in xrange(18)
];

我认为您应该阅读有关 start_requests() 以及如何生成下一个 url 的内容。但我不能在这里帮助你，因为我不使用 Scrapy。我仍然使用纯 python(和 pyQuery)来创建简单的爬虫；)

PS。有时服务器会检查您的 UserAgent、IP、您抓取下一页的速度并停止向您发送页面。

关于python - Scrapy爬虫无法从多个页面爬取数据，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/16862711/

python - Scrapy爬虫无法从多个页面爬取数据

上一篇：Python copy_from 不工作并且不抛出错误

下一篇：python - Tornado :线程未在协程中使用 @run_on_executor 启动