python - Scrapy——抓取页面并抓取下一页

标签 python web-scraping scrapy

我正在尝试从 RateMyProfessors 中获取在我的 items.py 文件中定义的教授统计信息:

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html

from scrapy.item import Item, Field


class ScraperItem(Item):
    # define the fields for your item here like:
    numOfPages = Field() # number of pages of professors (usually 476)

    firstMiddleName = Field() # first (and middle) name
    lastName = Field() # last name
    numOfRatings = Field() # number of ratings
    overallQuality = Field() # numerical rating
    averageGrade = Field() # letter grade
    profile = Field() # url of professor profile

    pass

这是我的scraper_spider.py文件:

import scrapy

from scraper.items import ScraperItem
from scrapy.contrib.spiders import Rule
from scrapy.contrib.linkextractors import LinkExtractor


class scraperSpider(scrapy.Spider):
    name = "scraper"
    allowed_domains = ["www.ratemyprofessors.com"]
    start_urls = [
    "http://www.ratemyprofessors.com/search.jsp?queryBy=teacherName&schoolName=pennsylvania+state+university"
    ]

    rules = (
        Rule(LinkExtractor(restrict_xpaths=('//a[@class="nextLink"]')),callback='parse',follow=True),
        )

    def parse(self, response):
        # professors = []
        numOfPages = int(response.xpath('((//a[@class="step"])[last()])/text()').extract()[0])

        # create array of profile links
        profiles = response.xpath('//li[@class="listing PROFESSOR"]/a/@href').extract()

        # for each of those links
        for profile in profiles:
            # define item
            professor = ScraperItem();

            # add profile to professor
            professor["profile"] = profile

            # pass each page to the parse_profile() method
            request = scrapy.Request("http://www.ratemyprofessors.com"+profile,
                 callback=self.parse_profile)
            request.meta["professor"] = professor

            # add professor to array of professors
            yield request


    def parse_profile(self, response):
        professor = response.meta["professor"]

        if response.xpath('//*[@class="pfname"]'):
            # scrape each item from the link that was passed as an argument and add to current professor
            professor["firstMiddleName"] = response.xpath('//h1[@class="profname"]/span[@class="pfname"][1]/text()').extract() 

        if response.xpath('//*[@class="plname"]'):
            professor["lastName"] = response.xpath('//h1[@class="profname"]/span[@class="plname"]/text()').extract()

        if response.xpath('//*[@class="table-toggle rating-count active"]'):
            professor["numOfRatings"] = response.xpath('//div[@class="table-toggle rating-count active"]/text()').extract()

        if response.xpath('//*[@class="grade"]'):
            professor["overallQuality"] = response.xpath('//div[@class="breakdown-wrapper"]/div[@class="breakdown-header"][1]/div[@class="grade"]/text()').extract()

        if response.xpath('//*[@class="grade"]'):
            professor["averageGrade"] = response.xpath('//div[@class="breakdown-wrapper"]/div[@class="breakdown-header"][2]/div[@class="grade"]/text()').extract()

        return professor

# add string to rule.  linkextractor only gets "/showratings.." not "ratemyprofessors.com/showratings"

我的问题出在上面的scraper_spider.py文件中。蜘蛛应该去 this RateMyProfessors 页面并转到每位教授并获取信息,然后返回目录并获取下一位教授的信息。当页面上没有更多的教授可供抓取后,它应该找到下一步按钮href值并转到该页面并遵循相同的方法。

我的抓取工具能够抓取目录第 1 页上的所有教授,但它会停止,因为它不会转到下一页。

你能帮助我的抓取工具成功找到并转到下一页吗?

我试图关注this StackOverflow 问题,但它太具体而无法使用。

最佳答案

如果您想使用 rules 属性,您的 scraperSpider 应继承自 CrawlSpider。请参阅文档 here 。另请注意文档中的此警告

When writing crawl spider rules, avoid using parse as callback, since the CrawlSpider uses the parse method itself to implement its logic. So if you override the parse method, the crawl spider will no longer work.

关于python - Scrapy——抓取页面并抓取下一页,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/35587062/

相关文章:

python - TestDome 联赛表 - Python

python - 无法从 xpath python 获取值

python - 无法使用我的抓取工具中的方法生成的链接

html - 使用 scrapy 和 xpath 在::before 和::after 之间抓取 HTML 元素

python - scrapy:蜘蛛中的一个小 "spider"?

Python - 单元格类型 'NoneType' 而值在里面。怎么改造呢?

Python:关于我的dfs代码的问题

python - Django Hello World 没有显示

通过 VBA 发送到 Google 搜索时无法识别带有特殊字母的 Excel 单元格值

python - 在python中使用scrapy执行Javascript提交表单函数