python - 如何选择带有右箭头的链接作为 xpath 的文本?

标签 python xpath unicode scrapy lxml

我正在尝试选择网站上的下一个按钮,它有一个向右箭头作为链接文本。当我使用“scrappy shell”查看源代码时,它向我显示该字符作为其 unicode 文字“\u2192”。以此,我开发了以下Scrapy CrawlSpider:

# -*- coding: utf-8 -*-
import scrapy
from scrapy.contrib.linkextractors import LinkExtractor
from scrapy.contrib.loader.processor import MapCompose
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy import log, Request
from yelpscraper.items import YelpscraperItem
import re, urlparse


class YelpSpider(CrawlSpider):
    name = 'yelp'
    allowed_domains = ['yelp.com']
    start_urls = ['http://www.yelp.com/search?find_desc=attorney&find_loc=Austin%2C+TX&start=0']

    rules = (
        Rule(LinkExtractor(allow=r'biz', restrict_xpaths='//*[contains(@class, "natural-search-result")]//a[@class="biz-name"]'), callback='parse_item', follow=True),
        Rule(LinkExtractor(allow=r'start', restrict_xpaths=u'//a[contains(@class, "prev-next")]/text()[contains(., "\u2192")]'), follow=True)
    )

    def parse_item(self, response):
        i = YelpscraperItem()
        i['phone'] = self.beautify(response.xpath('//*[@class="biz-phone"]/text()').extract())
        i['state'] = self.beautify(response.xpath('//span[@itemprop="addressRegion"]/text()').extract())
        i['company'] = self.beautify(response.xpath('//h1[contains(@class, "biz-page-title")]/text()').extract())

        website = i['website'] = self.beautify(response.xpath('//div[@class="biz-website"]/a/text()').extract())
        yield i

记下规则属性中的第二个元组声明,其中包含有问题的 unicode 字符:

Rule(LinkExtractor(allow=r'start', restrict_xpaths=u'//a[contains(@class, "prev-next")]/text()[contains(., "\u2192")]'), follow=True)

当我尝试运行这个蜘蛛时,我得到以下回溯:

Traceback (most recent call last):
    File "C:\Python27\lib\site-packages\twisted\internet\base.py", line 824, in runUntilCurrent
     call.func(*call.args, **call.kw)
    File "C:\Python27\lib\site-packages\twisted\internet\task.py", line 607, in _tick
     taskObj._oneWorkUnit()
    File "C:\Python27\lib\site-packages\twisted\internet\task.py", line 484, in _oneWorkUnit
     result = next(self._iterator)
    File "C:\Python27\lib\site-packages\scrapy-0.24.4-py2.7.egg\scrapy\utils\defer.py", line 57, in <genexpr>
     work = (callable(elem, *args, **named) for elem in iterable)
    --- <exception caught here> ---
    File "C:\Python27\lib\site-packages\scrapy-0.24.4-py2.7.egg\scrapy\utils\defer.py", line 96, in iter_errback
     yield next(it)
    File "C:\Python27\lib\site-packages\scrapy-0.24.4-py2.7.egg\scrapy\contrib\spidermiddleware\offsite.py", line 26, in process_spider_output
     for x in result:
    File "C:\Python27\lib\site-packages\scrapy-0.24.4-py2.7.egg\scrapy\contrib\spidermiddleware\referer.py", line 22, in <genexpr>
     return (_set_referer(r) for r in result or ())
    File "C:\Python27\lib\site-packages\scrapy-0.24.4-py2.7.egg\scrapy\contrib\spidermiddleware\urllength.py", line 33, in <genexpr>
     return (r for r in result or () if _filter(r))
    File "C:\Python27\lib\site-packages\scrapy-0.24.4-py2.7.egg\scrapy\contrib\spidermiddleware\depth.py", line 50, in <genexpr>
     return (r for r in result or () if _filter(r))
    File "C:\Python27\lib\site-packages\scrapy-0.24.4-py2.7.egg\scrapy\contrib\spiders\crawl.py", line 73, in _parse_response
     for request_or_item in self._requests_to_follow(response):
    File "C:\Python27\lib\site-packages\scrapy-0.24.4-py2.7.egg\scrapy\contrib\spiders\crawl.py", line 52, in _requests_to_follow
     links = [l for l in rule.link_extractor.extract_links(response) if l not in seen]
    File "C:\Python27\lib\site-packages\scrapy-0.24.4-py2.7.egg\scrapy\contrib\linkextractors\lxmlhtml.py", line 107, in extract_links
     links = self._extract_links(doc, response.url, response.encoding, base_url)
    File "C:\Python27\lib\site-packages\scrapy-0.24.4-py2.7.egg\scrapy\linkextractor.py", line 94, in _extract_links
     return self.link_extractor._extract_links(*args, **kwargs)
    File "C:\Python27\lib\site-packages\scrapy-0.24.4-py2.7.egg\scrapy\contrib\linkextractors\lxmlhtml.py", line 50, in _extract_links
     for el, attr, attr_val in self._iter_links(selector._root):
    File "C:\Python27\lib\site-packages\scrapy-0.24.4-py2.7.egg\scrapy\contrib\linkextractors\lxmlhtml.py", line 38, in _iter_links
     for el in document.iter(etree.Element):
    exceptions.AttributeError: 'unicode' object has no attribute 'iter'

我想做的就是选择这个链接,但我想不出一种不使用这个字符来选择它的方法。 (它根据页面移动)。无论如何,是否可以使用 ASCII 代码或 unicode 以外的代码来选择它?这似乎是导致问题的原因?

最佳答案

根据文档,restrict_xpaths 应该是 liststr

您正在传递一个 unicode 字符串。这就是您收到错误的原因。

此外,您不需要检查 text(),检查 prev-next 类就足够了:

rules = (
    Rule(LinkExtractor(allow=r'biz', restrict_xpaths='//*[contains(@class, "natural-search-result")]//a[@class="biz-name"]'),
         callback='parse_item', follow=True),
    Rule(LinkExtractor(allow=r'start', restrict_xpaths='//a[contains(@class, "prev-next")]'),
         follow=True)
)

已测试(抓取时没有错误,它遵循分页)。

关于python - 如何选择带有右箭头的链接作为 xpath 的文本?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/27596647/

相关文章:

python - 在python中按优先级对矩阵进行排序

python - 使用 for 和 while 循环编写质数函数的最 Pythonic 方法是什么?

python - 从数据框中删除与另一个数据框不匹配的项目?

python - 在没有外部库的情况下将 CSS 选择器转换为 python 中的 XPath 选择器

.net - 如何实现简单的XPath查找

ruby - 使用 nokogiri 提取 HTML 标签之间的文本

c# - 如何在.NET 中读取包含 Unicode 的 xml 文本文件,然后将其保存到数据库中?

python - MongoDB 分组

php - Json 编码对某些字符返回空值

python - Python 3 源文件支持哪些文件编码?