python - scrapy:CrawlSpider 中的 'exceptions.KeyError'

我正在尝试抓取以下网站上的所有相关字段，以便我可以将所有数据加载到电子表格中:

http://yellowpages.com.gh/Home.aspx?

我猜 CrawlSpider 就是我想要的，所以这就是我一直在尝试构建的:

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import Selector
from scrapy.item import Item
class YellowGH2Spider(CrawlSpider):
    name = "yellowGH2"
    allowed_domains = ["yellowpages.com.gh"]
    start_urls = ["http://yellowpages.com.gh/Home.aspx"]
    rules = (
        Rule(SgmlLinkExtractor(allow=(r'http://yellowpages.com.gh/Home.aspx?mcaid=\d+#tabs-2', ))),
        Rule(SgmlLinkExtractor(allow=(r'http://yellowpages.com.gh/(Home|Search-Results).aspx?mcaid=[0-9&eca1id=]+(&lcaid=)?\d+#tabs-2', )), callback='parse_item'),
        Rule(SgmlLinkExtractor(allow=(r'http://yellowpages.com.gh/Company-Details/[a-zA-Z0-9-]+.aspx?returnurl=/Search-Results.aspx', )), callback='parse_item'),
        )
    def parse(self, response):
        #hxs = HtmlXPathSelector(response)
        #filename = response.url.split("/")[-2]
        #open(filename, 'wb').write(response.body)

        sel = Selector(response)
        item = Item()
        #item['catName']=sel.xpath('//div[@class="oneDirCat"]/h3/a/text()').extract()
        item['catLink']=sel.xpath('//div[@class="oneDirCat"]/h3/a/@href').extract()
        item['subcatText']=sel.xpath('//ul/li/a/@href').extract()
        item['subcatLink']=sel.xpath('//div[@class="oneDirCat"]/h3/a/text()').extract()
        item['company']=sel.xpath('//label/text()').extract()
        item['more']=sel.xpath('//td[@valign="bottom"]/a/@href').extract()
        item['address']=sel.xpath('//td[2]/text()').extract()
        item['postAddress']=sel.xpath('//td[4]/text()').extract()
        item['city']=sel.xpath('//td[6]/text()').extract()
        item['region']=sel.xpath('//td[8]/text()').extract()
        item['mobile']=sel.xpath('//td[12]/text()').extract()
        item['emailtext']=sel.xpath('//td[16]/a/text()').extract()
        item['emailLink']=sel.xpath('//td[16]/a/@href').extract()
        item['webtext']=sel.xpath('//td[18]/a/text()').extract()
        item['webLink']=sel.xpath('//td[18]/a/@href').extract()
        return item


            #print catName, catLink, subcatText, subcatLink, company, more,
            #address, postAddress, city, region, mobile, emailtext, emailLink,
            #webtext, webLink

但是，在命令提示符下运行它时，出现以下错误:

exceptions.KeyError:“项目不支持字段:catLink”

发生此类错误的最可能原因是什么？它可以与我的 XPaths 格式相关联吗？或者它可能与这个蜘蛛与项目中的原始蜘蛛共享相同的 items.py 文件这一事实有关吗？

我的items.py代码如下:

# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html

from scrapy.item import Item, Field

class YellowghItem(Item):
    # define the fields for your item here like:
    # name = Field()
      catName = Field()
      catLink = Field()
      subcatText = Field()
      subcatLink = Field()
      company = Field()
      more = Field()
      address = Field()
      postAddress = Field()
      city = Field()
      region = Field()
      mobile = Field()
      emailtext = Field()
      emailLink = Field()
      webtext = Field()
      webLink = Field()

      #pass

最佳答案

这就是您看到错误的原因。您的 item.py 文件定义了类 YellowghItem。此类具有类成员 catLink。

但是在你的蜘蛛中，你并没有实例化这个类。相反，您正在实例化一个 Item() 类。我敢打赌，您的项目中还有另一个名为 Item 的类，它没有定义 catLink 的成员。

在你的蜘蛛中做这些改变:

更改从 scrapy.item import YellowghItem 的导入
在您的parse 方法中，实例化此类的一个对象:
```
item = YellowghItem()
```

尝试进行这些更改，我认为您将能够解决此错误。

希望这对您有所帮助。

关于python - scrapy:CrawlSpider 中的 'exceptions.KeyError'，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/23184688/

python - scrapy:CrawlSpider 中的 'exceptions.KeyError'

上一篇：python - 如何在 Python 中使用锁、继承和线程来避免类属性初始化的死锁？

下一篇：python - 发布数据时无法筛选 ASP.Net 网站