与我之前的一个问题相关但不同,Extracting p within h1 with Python/Scrapy ,我遇到过 Scrapy(用于 Python)不会在 h4 标签中提取 span 标签的情况。
示例 HTML 是:
<div class="event-specifics">
<div class="event-location">
<h3> Gourmet Matinee </h3>
<h4>
<span id="spanEventDetailPerformanceLocation">Knight Grove</span>
</h4>
</div>
</div>
我试图在 span 标签中获取文本“Knight Grove”。在命令行使用scrapy shell时,
response.xpath('.//div[@class="event-location"]//span//text()').extract()
返回:
['Knight Grove']
和
response.xpath('.//div[@class="event-location"]/node()')
返回整个节点,即:
['\n ', '<h3>\n Gourmet Matinee</h3>', '\n ', '<h4><span id="spanEventDetailPerformanceLocation"><p>Knight Grove</p></span></h4>', '\n ']
但是,当在蜘蛛中运行相同的 Xpath 时,不会返回任何内容。以下面的蜘蛛代码为例,编写它是为了抓取上面示例 HTML 的页面,https://www.clevelandorchestra.com/17-blossom--summer/1718-gourmet-matinees/2017-07-11-gourmet-matinee/ . (部分代码与问题无关,已删除):
# -*- coding: utf-8 -*-
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.loader import ItemLoader
from concertscraper.items import Concert
from scrapy.contrib.loader import XPathItemLoader
from scrapy import Selector
from scrapy.http import XmlResponse
class ClevelandOrchestra(CrawlSpider):
name = 'clev2'
allowed_domains = ['clevelandorchestra.com']
start_urls = ['https://www.clevelandorchestra.com/']
rules = (
Rule(LinkExtractor(allow=''), callback='parse_item', follow=True),
)
def parse_item(self, response):
thisconcert = ItemLoader(item=Concert(), response=response)
for concert in response.xpath('.//div[@class="event-wrap"]'):
thisconcert.add_xpath('location','.//div[@class="event-location"]//span//text()')
return thisconcert.load_item()
这不返回任何项目['location']。我也试过:
thisconcert.add_xpath('location','.//div[@class="event-location"]/node()')
不同于上面关于 h 中的 p 的问题,HTML 中的 h 标签中允许使用 span 标签,除非我弄错了?
为清楚起见,“位置”字段是在 Concert() 对象中定义的,我禁用了所有管道以便进行故障排除。
h4 中的 span 在某种程度上可能是无效的 HTML;如果不是,可能是什么原因造成的?
有趣的是,使用 add_css() 执行相同的任务,如下所示:
thisconcert.add_css('location','.event-location')
产生一个带有 span 标签但缺少内部文本的节点:
['<div class="event-location">\r\n'
' <h3>\r\n'
' BLOSSOM MUSIC FESTIVAL </h3>\r\n'
' <h4><span '
'id="spanEventDetailPerformanceLocation"></span></h4>\r\n'
' </div>']
确认这不是重复的:在这个特定的例子中确实有一个 p 标签在一个 span 标签内,它在 h4 标签内;但是,当不涉及 p 标签时,会发生相同的行为,例如:https://www.clevelandorchestra.com/1718-concerts-pdps/1718-rental-concerts/1718-rentals-other/2017-07-21-cooper-competition/?performanceNumber=16195 .
最佳答案
此内容通过 Ajax 调用加载。为了获取数据,您需要发出类似的 POST
请求,并且不要忘记添加内容类型为: headers = {'content-type': "application/json"}
然后你得到 Json 文件作为响应。
import requests
url = "https://www.clevelandorchestra.com/Services/PerformanceService.asmx/GetToolTipPerformancesForCalendar"
payload = {"startDate": "2017-06-30T21:00:00.000Z", "endDate": "2017-12-31T21:00:00.000Z"}
headers = {'content-type': "application/json"}
json_response = requests.post(url, json=payload, headers=headers).json()
for performance in json_response['d']:
print(performance["performanceName"], performance["dateString"])
# Star-Spangled Spectacular Friday, June 30, 2017
# Blossom: Tchaikovskys Spectacular 1812 Overture Saturday, July 1, 2017
# Blossom: Tchaikovskys Spectacular 1812 Overture Sunday, July 2, 2017
# Blossom: A Salute to America Monday, July 3, 2017
# Blossom: A Salute to America Tuesday, July 4, 2017
关于python - 在 Scrapy 中使用 XPath 提取 HTML 结果失败,因为内容是动态加载的,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/44856285/