python - lxml.etree.XPathEvalError : Invalid expression

标签 python xpath lxml

我在使用 Python 时遇到了一个我无法理解的错误。我已将代码简化到最低限度:

response = requests.get('http://pycoders.com/archive')
tree = html.fromstring(response.text)
r = tree.xpath('//divass="campaign"]/a/@href')
print(r)

仍然出现错误

Traceback (most recent call last):
File "ultimate-1.py", line 17, in <module>
r = tree.xpath('//divass="campaign"]/a/@href')
File "lxml.etree.pyx", line 1509, in lxml.etree._Element.xpath (src/lxml/lxml.etree.c:50702)
File "xpath.pxi", line 318, in lxml.etree.XPathElementEvaluator.__call__ (src/lxml/lxml.etree.c:145954)
File "xpath.pxi", line 238, in lxml.etree._XPathEvaluatorBase._handle_result (src/lxml/lxml.etree.c:144962)
File "xpath.pxi", line 224, in lxml.etree._XPathEvaluatorBase._raise_eval_error (src/lxml/lxml.etree.c:144817)
lxml.etree.XPathEvalError: Invalid expression

有人知道问题出在哪里吗?可能是依赖问题吗?谢谢。

最佳答案

表达式'//divass="campaign"]/a/@href'在语法上不正确并且没有多大意义。相反,您的意思是检查 class 属性:

//div[@class="campaign"]/a/@href

现在,这将有助于避免无效表达式错误,但您将无法通过表达式找到任何内容。这是因为 requests 收到的响应中不存在数据。您需要模仿浏览器的操作来获取所需的数据,并发出额外的请求来获取包含营销事件的 javascript 文件。

以下是对我有用的方法:

import ast
import re

import requests
from lxml import html

with requests.Session() as session:
    # extract script url
    response = session.get('http://pycoders.com/archive')
    tree = html.fromstring(response.text)
    script_url = tree.xpath("//script[contains(@src, 'generate-js')]/@src")[0]

    # get the script
    response = session.get(script_url)
    data = ast.literal_eval(re.match(r'document.write\((.*?)\);$', response.content).group(1))

    # extract the desired data
    tree = html.fromstring(data)
    campaigns = [item.attrib["href"].replace("\\", "") for item in tree.xpath('//div[@class="campaign"]/a')]
    print(campaigns)

打印:

['http://us4.campaign-archive2.com/?u=9735795484d2e4c204da82a29&id=3384ab2140', 
 ...
 'http://us4.campaign-archive2.com/?u=9735795484d2e4c204da82a29&id=8b91cb0481'
]

关于python - lxml.etree.XPathEvalError : Invalid expression,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/37255555/

相关文章:

python - AWS Lambda 不导入 LXML

python - 从 Python 脚本查询 Windows 10/8 监视器缩放?

python - 编码为 utf8 时丢失数据

html - xpath 根据(相对)子内容选择父级

java - 默认 XML namespace 、JDOM 和 XPath

php - 尝试获取php中标签的值

python - 如何根据lxml中的 child 选择 parent ?

python - 从模块函数调用对象

python - CSS 不适用于 Django 支持的网站

python - 如何在 python 中使用 xpath 查询带有命名空间的 xml 数据