python - Scrapy爬虫不处理XHR请求

标签 python web-scraping xmlhttprequest scrapy scrape

我的蜘蛛只爬行前 10 页,因此我假设它没有通过请求进入“加载更多”按钮。

我正在抓取这个网站:http://www.t3.com/reviews .

我的蜘蛛代码:

import scrapy
from scrapy.conf import settings
from scrapy.http import Request
from scrapy.selector import Selector
from reviews.items import ReviewItem


class T3Spider(scrapy.Spider):
    name = "t3" #spider name to call in terminal
    allowed_domains = ['t3.com'] #the domain where the spider is allowed to crawl
    start_urls = ['http://www.t3.com/reviews'] #url from which the spider will start crawling

    def parse(self, response):
        sel = Selector(response)
        review_links = sel.xpath('//div[@id="content"]//div/div/a/@href').extract()
        for link in review_links:
            yield Request(url="http://www.t3.com"+link, callback=self.parse_review)
#if there is a load-more button:
        if sel.xpath('//*[@class="load-more"]'):
            req = Request(url=r'http://www\.t3\.com/more/reviews/latest/\d+', headers = {"Referer": "http://www.t3.com/reviews", "X-Requested-With": "XMLHttpRequest"}, callback=self.parse)
            yield req
        else:
            return

    def parse_review(self, response):
        pass #all my scraped item fields

我做错了什么?抱歉,我对 scrapy 很陌生。感谢您的时间、耐心和帮助。

最佳答案

如果您检查“加载更多”按钮,您将找不到任何有关如何构建加载更多评论的链接的指示。背后的想法相当简单 - http://www.t3.com/more/reviews/latest/ 之后的数字可疑地看起来像上次加载文章的时间戳。以下是获取它的方法:

import calendar

from dateutil.parser import parse
import scrapy
from scrapy.http import Request


class T3Spider(scrapy.Spider):
    name = "t3"
    allowed_domains = ['t3.com']
    start_urls = ['http://www.t3.com/reviews']

    def parse(self, response):
        reviews = response.css('div.listingResult')
        for review in reviews:
            link = review.xpath("a/@href").extract()[0]
            yield Request(url="http://www.t3.com" + link, callback=self.parse_review)

        # TODO: handle exceptions here

        # extract the review date
        time = reviews[-1].xpath(".//time/@datetime").extract()[0]

        # convert a date into a timestamp
        timestamp = calendar.timegm(parse(time).timetuple())

        url = 'http://www.t3.com/more/reviews/latest/%d' % timestamp
        req = Request(url=url,
                      headers={"Referer": "http://www.t3.com/reviews", "X-Requested-With": "XMLHttpRequest"},
                      callback=self.parse)
        yield req

    def parse_review(self, response):
        print response.url

注释:

  • 这需要 dateutil要安装的模块
  • 您应该重新检查代码并确保您获得所有评论,而不会跳过任何评论
  • 你应该以某种方式结束这个“加载更多”的事情

关于python - Scrapy爬虫不处理XHR请求,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/30348948/

相关文章:

python - 使用 Python 读取 YAML 文件会导致 yaml.composer.ComposerError : expected a single document in the stream

python - 如何在 python 中创建元组的元组?

python - BeautifulSoup Steam 市场网页抓取错误

php - 在 php 中,如何获取 XMLHttpRequest 的 send() 方法的文本/纯文本值

javascript - 阻塞的异步调用

python - numpy.logic_and 代码示例中的解释

python - 为什么这个不断出现在两个单独的窗口中? (Python)

Perl:WWW:Mechanize 和表单的问题

python - Webscraping - 不显示 html 代码的文本部分

javascript - 当前的 XHR 实现是否利用了 HTTP/2?