python - Scrapy 通过 IP 地址爬取本地网站

标签 python web-crawler scrapy

我仍在试验 Scrapy,我正在尝试抓取本地网络上的网站。该网站的 IP 地址为 192.168.0.185。这是我的蜘蛛:

 from scrapy.spider import BaseSpider
 class 192.168.0.185_Spider(BaseSpider):
      name = "192.168.0.185"
      allowed_domains = ["192.168.0.185"]
      start_urls = ["http://192.168.0.185/"]

      def parse(self, response):
          print "Test:", response.headers

然后在与我的蜘蛛相同的目录中,我将执行此 shell 命令来运行蜘蛛:

scrapy crawl 192.168.0.185

我收到一条非常丑陋、不可读的错误消息:

 2012-02-10 20:55:18-0600 [scrapy] INFO: Scrapy 0.14.0 started (bot: tutorial)
 2012-02-10 20:55:18-0600 [scrapy] DEBUG: Enabled extensions: LogStats,   
 TelnetConsole,     CloseSpider, WebService, CoreStats, MemoryUsage, SpiderState
 2012-02-10 20:55:18-0600 [scrapy] DEBUG: Enabled downloader middlewares:      
 HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware,  
 DefaultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware, 
 HttpCompressionMiddleware, ChunkedTransferMiddleware, DownloaderStats
 2012-02-10 20:55:18-0600 [scrapy] DEBUG: Enabled spider middlewares:   
 HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware,  
 DepthMiddleware 2012-02-10 20:55:18-0600 [scrapy] DEBUG: Enabled item pipelines: 
 Traceback (most recent call last): File "/usr/bin/scrapy", line 5, in <module>
 pkg_resources.run_script('Scrapy==0.14.0', 'scrapy')
 File "/usr/lib/python2.6/site-packages/pkg_resources.py", line 467, in run_script
 self.require(requires)[0].run_script(script_name, ns)
 File "/usr/lib/python2.6/site-packages/pkg_resources.py", line 1200, in run_script
 execfile(script_filename, namespace, namespace)
 File "/usr/lib/python2.6/site-packages/Scrapy-0.14.0-py2.6.egg/EGG-INFO/scripts   
 /scrapy", line 4, in <module>
 execute()
 File "/usr/lib/python2.6/site-packages/Scrapy-0.14.0-py2.6.egg/scrapy/cmdline.py",  
 line 132, in execute
 _run_print_help(parser, _run_command, cmd, args, opts)
 File "/usr/lib/python2.6/site-packages/Scrapy-0.14.0-py2.6.egg/scrapy/cmdline.py",   
 line 97, in _run_print_help func(*a, **kw)
 File "/usr/lib/python2.6/site-packages/Scrapy-0.14.0-py2.6.egg/scrapy/cmdline.py",  
 line 139, in _run_command cmd.run(args, opts)
 File "/usr/lib/python2.6/site-packages/Scrapy-0.14.0-py2.6.egg/scrapy/commands   
 /crawl.py", line 43, in run
 spider = self.crawler.spiders.create(spname, **opts.spargs)
 File "/usr/lib/python2.6/site-packages/Scrapy-0.14.0-py2.6.egg/scrapy  
 /spidermanager.py", line 43, in create
 raise KeyError("Spider not found: %s" % spider_name)
 KeyError: 'Spider not found: 192.168.0.185'

然后我做了另一个蜘蛛,它实际上与第一个蜘蛛相同,只是它使用域名而不是 IP 地址。这个工作得很好。有谁知道这笔交易是什么?如何让 Scrapy 通过 IP 地址而不是域名来抓取网站?

from scrapy.spider import BaseSpider
class facebook_Spider(BaseSpider):
     name = "facebook"
     allowed_domains = ["facebook.com"]
     start_urls = ["http://www.facebook.com/"]


     def parse(self, response):
         print "Test:", response.headers

最佳答案

class 192.168.0.185_Spider(BaseSpider):
    ...

您不能在 Python 中使用以数字开头或包含点的类名。请参阅文档 Identifiers and keywords

你可以用正确的名字创建这个蜘蛛:

$ scrapy startproject testproj
$ cd testproj
$ scrapy genspider testspider 192.168.0.185
  Created spider 'testspider' using template 'crawl' in module:
    testproj.spiders.testspider

蜘蛛定义如下所示:

class TestspiderSpider(CrawlSpider):
    name = 'testspider'
    allowed_domains = ['192.168.0.185']
    start_urls = ['http://www.192.168.0.185/']
    ...

也许您应该从 start_urls 中删除 www。要开始爬行,请使用蜘蛛名称而不是主机:

$ scrapy crawl testspider

关于python - Scrapy 通过 IP 地址爬取本地网站,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/9237545/

相关文章:

python - 为什么使用 str(df[col]) 会导致代码仅与单个记录相关?

python - “bytes”对象不能被解释为整数

python - scrapy xpath 如何

Java:抓取数字

python - 使用 iframe 抓取网站

python scrapy登录重定向问题

html - XPath:如何捕获前一个元素?

python - 如何制定更好的多次读数据库策略?

python - 脂肪 : can't figure out the architecture type of:/var/folders/

php - 使用 PHP/Python 下载 url 中的特定文件