我正在尝试从 RateMyProfessors 中获取在我的 items.py 文件中定义的教授统计信息:
# -*- coding: utf-8 -*-
# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html
from scrapy.item import Item, Field
class ScraperItem(Item):
# define the fields for your item here like:
numOfPages = Field() # number of pages of professors (usually 476)
firstMiddleName = Field() # first (and middle) name
lastName = Field() # last name
numOfRatings = Field() # number of ratings
overallQuality = Field() # numerical rating
averageGrade = Field() # letter grade
profile = Field() # url of professor profile
pass
这是我的scraper_spider.py文件:
import scrapy
from scraper.items import ScraperItem
from scrapy.contrib.spiders import Rule
from scrapy.contrib.linkextractors import LinkExtractor
class scraperSpider(scrapy.Spider):
name = "scraper"
allowed_domains = ["www.ratemyprofessors.com"]
start_urls = [
"http://www.ratemyprofessors.com/search.jsp?queryBy=teacherName&schoolName=pennsylvania+state+university"
]
rules = (
Rule(LinkExtractor(restrict_xpaths=('//a[@class="nextLink"]')),callback='parse',follow=True),
)
def parse(self, response):
# professors = []
numOfPages = int(response.xpath('((//a[@class="step"])[last()])/text()').extract()[0])
# create array of profile links
profiles = response.xpath('//li[@class="listing PROFESSOR"]/a/@href').extract()
# for each of those links
for profile in profiles:
# define item
professor = ScraperItem();
# add profile to professor
professor["profile"] = profile
# pass each page to the parse_profile() method
request = scrapy.Request("http://www.ratemyprofessors.com"+profile,
callback=self.parse_profile)
request.meta["professor"] = professor
# add professor to array of professors
yield request
def parse_profile(self, response):
professor = response.meta["professor"]
if response.xpath('//*[@class="pfname"]'):
# scrape each item from the link that was passed as an argument and add to current professor
professor["firstMiddleName"] = response.xpath('//h1[@class="profname"]/span[@class="pfname"][1]/text()').extract()
if response.xpath('//*[@class="plname"]'):
professor["lastName"] = response.xpath('//h1[@class="profname"]/span[@class="plname"]/text()').extract()
if response.xpath('//*[@class="table-toggle rating-count active"]'):
professor["numOfRatings"] = response.xpath('//div[@class="table-toggle rating-count active"]/text()').extract()
if response.xpath('//*[@class="grade"]'):
professor["overallQuality"] = response.xpath('//div[@class="breakdown-wrapper"]/div[@class="breakdown-header"][1]/div[@class="grade"]/text()').extract()
if response.xpath('//*[@class="grade"]'):
professor["averageGrade"] = response.xpath('//div[@class="breakdown-wrapper"]/div[@class="breakdown-header"][2]/div[@class="grade"]/text()').extract()
return professor
# add string to rule. linkextractor only gets "/showratings.." not "ratemyprofessors.com/showratings"
我的问题出在上面的scraper_spider.py文件中。蜘蛛应该去 this RateMyProfessors 页面并转到每位教授并获取信息,然后返回目录并获取下一位教授的信息。当页面上没有更多的教授可供抓取后,它应该找到下一步按钮的href值并转到该页面并遵循相同的方法。
我的抓取工具能够抓取目录第 1 页上的所有教授,但它会停止,因为它不会转到下一页。
你能帮助我的抓取工具成功找到并转到下一页吗?
我试图关注this StackOverflow 问题,但它太具体而无法使用。
最佳答案
如果您想使用 rules
属性,您的 scraperSpider
应继承自 CrawlSpider
。请参阅文档 here 。另请注意文档中的此警告
When writing crawl spider rules, avoid using parse as callback, since the CrawlSpider uses the parse method itself to implement its logic. So if you override the parse method, the crawl spider will no longer work.
关于python - Scrapy——抓取页面并抓取下一页,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/35587062/