我正在尝试使用XPath获取位于<script>
标记内的页面上的元素。例如:
<div id="foo">
<script>
<p>You can't get me.</p>
</script>
</div>
如果我尝试
response.xpath('//div[@id="foo"]//p')
或response.xpath('//div[@id="foo"]/script/p')
,则都返回一个空数组。如何使用XPath获取
<script>
标记内的元素?
最佳答案
eLRuLL为我的问题提供了更加优雅和更好的answer。他的解决方案如下:
from scrapy import Selector
#First, retrieve the content within the <script> tag:
text = response.xpath('//script/text()').extract_first()
#Then, create a Selector
sel = Selector(text=text)
#Now we can use XPath normally as if the text was a common HTML response
sel.xpath(//p/text()).extract_first()
旧答案:
<script>
节点只有文本类型的子代。这就是为什么XPath不会深入到<script>
标记的原因。但是,我找到了解决方法。#First, retrieve the content within the <script> tag:
text = response.xpath('//script/text()').extract_first()
#Then, encode it
text_encoded = text.encode('utf-8')
#Now, convert it to a HtmlResponse object
text_in_html = HtmlResponse(url='some url', body=text_encoded, encoding='utf-8')
#Now we can use XPath normally as if the text was a common HTML response
text_in_html.xpath(//p/text()).extract_first()
关于xpath - 使用XPath检索<script>标记内的元素,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/53178974/