python - 无效选择器错误 : Webscraping different kinds of text from multiple spans using xpath and Selenium

我正在尝试按以下格式刮出以逗号分隔并带有星号的作者列表[重要]:

名末、名末、名末*、名末

我正在抓取的 html 部分非常复杂，但我已经成功测试了一个 xpath，它会产生我想要的文本和符号。

//span[@class="hlFld-ContribAuthor"]/span[@class="hlFld-ContribAuthor"]/a/text() | //span[@class="NLM_x"]/x/text() | //a[@class="ref"]/sup/text()

结果如下:

但是，当我在 python 代码中使用该公式时，出现错误。

我的代码:

# get authors
xpath = "//span[@class=\"hlFld-ContribAuthor\"]/span[@class=\"hlFld-ContribAuthor\"]/a/text() | //span[@class=\"NLM_x\"]/x/text() | //a[@class=\"ref\"]/sup/text()"
authors = driver.find_element_by_xpath(xpath)
print str(authors)

错误:

InvalidSelectorException: Message: The given selector //span[@class="hlFld-ContribAuthor"]/span[@class="hlFld-ContribAuthor"]/a/text() | //span[@class="NLM_x"]/x/text() | //a[@class="ref"]/sup/text() is either invalid or does not result in a WebElement. The following error occurred: InvalidSelectorError: The result of the xpath expression "//span[@class="hlFld-ContribAuthor"]/span[@class="hlFld-ContribAuthor"]/a/text() | //span[@class="NLM_x"]/x/text() | //a[@class="ref"]/sup/text()" is: [object Text]. It should be an element.

如何让 selenium 以正确的顺序获取我需要的正确文本和符号？我无法在没有换行的情况下打印 xpath 的结果。

编辑:通过从 xpath 中删除/text() 解决了 xpath 错误

最佳答案

函数driver.find_element_by_xpath(my_xpath)期望在找到由my_xpath标识的节点时找到一个DOM元素。如果没有，则会抛出错误。您的 XPath 表达式都返回文本节点，因此会导致错误。

要返回 DOM 元素，请将 XPath 表达式更改为:

"//span[@class=\"hlFld-ContribAuthor\"]/span[@class=\"hlFld-ContribAuthor\"]/a |//span[@class=\"NLM_x\"]/x |//a[@class=\"ref\"]/sup"

此外，由于您要返回多个元素，因此应使用 driver.find_elements_by_xpath (注意复数)而不是 driver.find_element_by_xpath。

然后，您将能够通过循环 authors 从每个作者元素中获取所需的文本:

for author in authors:
    print(author.text)

关于python - 无效选择器错误 : Webscraping different kinds of text from multiple spans using xpath and Selenium，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/35491159/

python - 无效选择器错误 : Webscraping different kinds of text from multiple spans using xpath and Selenium

上一篇：python - SQLAlchemy 从 TypeDecorator 获取原始 SQL 值

下一篇：python - 如何使用参数化过滤创建类似 Twitter 的流 API？