python - 使用scrapy时出错

标签 python web-scraping scrapy

我在 python 中有这段代码:

import scrapy
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
from site_auto_1.items import AutoItem


class AutoSpider(CrawlSpider):
    name = "auto"

    allowed_host = ["autowereld.nl"]

    url = "http://www.autowereld.nl/"

    start_urls = [
            "http://www.autowereld.nl/zoeken.html?mrk=187&mdl%5B%5D=463&prvan=500&prtot=3000&brstf%5B%5D=2&bjvan=2000&bjtot=2004&geoloc=&strl=&trns%5B%5D=&kmvan=&kmtot=&klr%5B%5D=&q=",
            ]

    path = '//*[@id="content-inhoud"]/div/div/table/tbody/tr/td/h3/a/@href'

    rules = (
            Rule(
                LinkExtractor(restrict_xpaths='//*[@id="content-inhoud"]/div/div/table/tbody/tr/td/h3/a/@href'),
                callback='parse_item',
            ),  
        )   

    def parse_item(self, response):
        print "found item :', response.url

它给了我这个错误:

Traceback (most recent call last):
      File "/usr/lib/python2.7/dist-packages/twisted/internet/base.py", line 824, in runUntilCurrent
        call.func(*call.args, **call.kw)
      File "/usr/lib/python2.7/dist-packages/twisted/internet/task.py", line 638, in _tick
        taskObj._oneWorkUnit()
      File "/usr/lib/python2.7/dist-packages/twisted/internet/task.py", line 484, in _oneWorkUnit
        result = next(self._iterator)
      File "/usr/lib/pymodules/python2.7/scrapy/utils/defer.py", line 57, in <genexpr>
        work = (callable(elem, *args, **named) for elem in iterable)
    --- <exception caught here> ---
      File "/usr/lib/pymodules/python2.7/scrapy/utils/defer.py", line 96, in iter_errback
        yield next(it)
      File "/usr/lib/pymodules/python2.7/scrapy/contrib/spidermiddleware/offsite.py", line 26, in process_spider_output
        for x in result:
      File "/usr/lib/pymodules/python2.7/scrapy/contrib/spidermiddleware/referer.py", line 22, in <genexpr>
        return (_set_referer(r) for r in result or ())
      File "/usr/lib/pymodules/python2.7/scrapy/contrib/spidermiddleware/urllength.py", line 33, in <genexpr>
        return (r for r in result or () if _filter(r))
      File "/usr/lib/pymodules/python2.7/scrapy/contrib/spidermiddleware/depth.py", line 50, in <genexpr>
        return (r for r in result or () if _filter(r))
      File "/usr/lib/pymodules/python2.7/scrapy/contrib/spiders/crawl.py", line 73, in _parse_response
        for request_or_item in self._requests_to_follow(response):
      File "/usr/lib/pymodules/python2.7/scrapy/contrib/spiders/crawl.py", line 52, in _requests_to_follow
        links = [l for l in rule.link_extractor.extract_links(response) if l not in seen]
      File "/usr/lib/pymodules/python2.7/scrapy/contrib/linkextractors/lxmlhtml.py", line 107, in extract_links
        links = self._extract_links(doc, response.url, response.encoding, base_url)
      File "/usr/lib/pymodules/python2.7/scrapy/linkextractor.py", line 94, in _extract_links
        return self.link_extractor._extract_links(*args, **kwargs)
      File "/usr/lib/pymodules/python2.7/scrapy/contrib/linkextractors/lxmlhtml.py", line 50, in _extract_links
        for el, attr, attr_val in self._iter_links(selector._root):
      File "/usr/lib/pymodules/python2.7/scrapy/contrib/linkextractors/lxmlhtml.py", line 38, in _iter_links
        for el in document.iter(etree.Element):
    exceptions.AttributeError: 'str' object has no attribute 'iter'

我不知道我做错了什么,所以我开始评论代码,看到女巫拖了错误,我发现这就是这一部分:

rules = (
        Rule(
            LinkExtractor(restrict_xpaths='//*[@id="content-inhoud"]/div/div/table/tbody/tr/td/h3/a/@href'),
            callback='parse_item',
        ),  
    )   

但我不知道我做错了什么,我试图使 restrict_xpaths 成为一个列表,一个元组......我是 scrapy 的新手,我无法理解它出...

最佳答案

restict_xpaths中配置的XPath应该指向一个元素,而不是一个属性。

替换:

//*[@id="content-inhoud"]/div/div/table/tbody/tr/td/h3/a/@href

与:

//*[@id="content-inhoud"]/div/div/table/tbody/tr/td/h3/a

关于python - 使用scrapy时出错,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/28917235/

相关文章:

python - 为 Networkx 图添加标题?

python - 如何让文本框水平滚动插入的小部件?

python - 为什么 Python 不更深入地指定错误?

python - Pandas pd.DataFrame 转换为元组而不是 Dataframe

python - Scrapy 关注并抓取下一页

node.js - 使用请求时响应中的正文为空

python scrapy从网站中提取数据

java - java里面的Scrapy?

python - 无法强制 scrapy 使用重定向的 url 进行回调

Python:Scrapy CSV 导出不正确?