我正在尝试使用 Python 和 Scrapy 在一天内从所有国家/地区的所有机场取消所有出发和到达。
这个著名网站(飞行雷达)使用的JSON数据库需要在一个机场出发或到达> 100时逐页查询。我还根据查询的实际日期 UTC 计算时间戳。
我尝试创建具有此层次结构的数据库:
country 1
- airport 1
- departures
- page 1
- page ...
- arrivals
- page 1
- page ...
- airport 2
- departures
- page 1
- page ...
- arrivals
- page
- page ...
...
我使用两种方法来按页面计算时间戳和 url 查询:
def compute_timestamp(self):
from datetime import datetime, date
import calendar
# +/- 24 heures
d = date(2017, 4, 27)
timestamp = calendar.timegm(d.timetuple())
return timestamp
def build_api_call(self,code,page,timestamp):
return 'https://api.flightradar24.com/common/v1/airport.json?code={code}&plugin\[\]=&plugin-setting\[schedule\]\[mode\]=&plugin-setting\[schedule\]\[timestamp\]={timestamp}&page={page}&limit=100&token='.format(
code=code, page=page, timestamp=timestamp)
我将结果存储到 CountryItem
中,其中包含很多 AirportItem
到机场。我的 item.py
是:
class CountryItem(scrapy.Item):
name = scrapy.Field()
link = scrapy.Field()
num_airports = scrapy.Field()
airports = scrapy.Field()
other_url= scrapy.Field()
last_updated = scrapy.Field(serializer=str)
class AirportItem(scrapy.Item):
name = scrapy.Field()
code_little = scrapy.Field()
code_total = scrapy.Field()
lat = scrapy.Field()
lon = scrapy.Field()
link = scrapy.Field()
departures = scrapy.Field()
arrivals = scrapy.Field()
我的主要解析为所有国家构建了一个 Country 项目(例如,我在这里限制为以色列)。接下来,我为每个国家/地区生成一个 scrapy.Request
来抓取机场。
###################################
# MAIN PARSE
####################################
def parse(self, response):
count_country = 0
countries = []
for country in response.xpath('//a[@data-country]'):
item = CountryItem()
url = country.xpath('./@href').extract()
name = country.xpath('./@title').extract()
item['link'] = url[0]
item['name'] = name[0]
item['airports'] = []
count_country += 1
if name[0] == "Israel":
countries.append(item)
self.logger.info("Country name : %s with link %s" , item['name'] , item['link'])
yield scrapy.Request(url[0],meta={'my_country_item':item}, callback=self.parse_airports)
此方法为每个机场抓取信息,并为每个机场调用一个 scrapy.request
和机场 url 以抓取出发和到达:
###################################
# PARSE EACH AIRPORT
####################################
def parse_airports(self, response):
item = response.meta['my_country_item']
item['airports'] = []
for airport in response.xpath('//a[@data-iata]'):
url = airport.xpath('./@href').extract()
iata = airport.xpath('./@data-iata').extract()
iatabis = airport.xpath('./small/text()').extract()
name = ''.join(airport.xpath('./text()').extract()).strip()
lat = airport.xpath("./@data-lat").extract()
lon = airport.xpath("./@data-lon").extract()
iAirport = AirportItem()
iAirport['name'] = self.clean_html(name)
iAirport['link'] = url[0]
iAirport['lat'] = lat[0]
iAirport['lon'] = lon[0]
iAirport['code_little'] = iata[0]
iAirport['code_total'] = iatabis[0]
item['airports'].append(iAirport)
urls = []
for airport in item['airports']:
json_url = self.build_api_call(airport['code_little'], 1, self.compute_timestamp())
urls.append(json_url)
if not urls:
return item
# start with first url
next_url = urls.pop()
return scrapy.Request(next_url, self.parse_schedule, meta={'airport_item': item, 'airport_urls': urls, 'i': 0})
使用递归方法 parse_schedule
我将每个机场添加到国家项目。 SO成员已经help me关于这一点。
###################################
# PARSE EACH AIRPORT OF COUNTRY
###################################
def parse_schedule(self, response):
"""we want to loop this continuously to build every departure and arrivals requests"""
item = response.meta['airport_item']
i = response.meta['i']
urls = response.meta['airport_urls']
urls_departures, urls_arrivals = self.compute_urls_by_page(response, item['airports'][i]['name'], item['airports'][i]['code_little'])
print("urls_departures = ", len(urls_departures))
print("urls_arrivals = ", len(urls_arrivals))
## YIELD NOT CALLED
yield scrapy.Request(response.url, self.parse_departures_page, meta={'airport_item': item, 'page_urls': urls_departures, 'i':0 , 'p': 0}, dont_filter=True)
# now do next schedule items
if not urls:
yield item
return
url = urls.pop()
yield scrapy.Request(url, self.parse_schedule, meta={'airport_item': item, 'airport_urls': urls, 'i': i + 1})
self.compute_urls_by_page
方法计算正确的 URL 以检索一个机场的所有出发和到达。
###################################
# PARSE EACH DEPARTURES / ARRIVALS
###################################
def parse_departures_page(self, response):
item = response.meta['airport_item']
p = response.meta['p']
i = response.meta['i']
page_urls = response.meta['page_urls']
print("PAGE URL = ", page_urls)
if not page_urls:
yield item
return
page_url = page_urls.pop()
print("GET PAGE FOR ", item['airports'][i]['name'], ">> ", p)
jsonload = json.loads(response.body_as_unicode())
json_expression = jmespath.compile("result.response.airport.pluginData.schedule.departures.data")
item['airports'][i]['departures'] = json_expression.search(jsonload)
yield scrapy.Request(page_url, self.parse_departures_page, meta={'airport_item': item, 'page_urls': page_urls, 'i': i, 'p': p + 1})
接下来,通常调用 self.parse_departure_page
递归方法的 parse_schedule
中的第一个 yield 会产生奇怪的结果。 Scrapy 调用了这个方法,但我只收集了一个机场的出发页面,我不明白为什么...... 我的请求或 yield 源代码中可能有一个订购错误,所以也许你可以帮忙我来找出答案。
完整代码在GitHub上https://github.com/IDEES-Rouen/Flight-Scrapping/tree/master/flight/flight_project
您可以使用 scrapy cawl airports
命令运行它。
更新 1:
我尝试使用 yield from
单独回答这个问题,但没有成功,因为您可以在底部看到答案……如果您有想法?
最佳答案
是的,我终于找到了答案here所以...
当你使用递归yield
时,你需要使用yield from
。这里有一个简化的例子:
airport_list = ["airport1", "airport2", "airport3", "airport4"]
def parse_page_departure(airport, next_url, page_urls):
print(airport, " / ", next_url)
if not page_urls:
return
next_url = page_urls.pop()
yield from parse_page_departure(airport, next_url, page_urls)
###################################
# PARSE EACH AIRPORT OF COUNTRY
###################################
def parse_schedule(next_airport, airport_list):
## GET EACH DEPARTURE PAGE
departures_list = ["p1", "p2", "p3", "p4"]
next_departure_url = departures_list.pop()
yield parse_page_departure(next_airport,next_departure_url, departures_list)
if not airport_list:
print("no new airport")
return
next_airport_url = airport_list.pop()
yield from parse_schedule(next_airport_url, airport_list)
next_airport_url = airport_list.pop()
result = parse_schedule(next_airport_url, airport_list)
for i in result:
print(i)
for d in i:
print(d)
更新,不要使用真正的程序:
我尝试重现相同的 yield from
模式 with the real program here , 但我在 scrapy.Request
上使用它时出错,不明白为什么...
这里是 python 回溯:
Traceback (most recent call last):
File "/home/reyman/.pyenv/versions/venv352/lib/python3.5/site-packages/scrapy/utils/defer.py", line 102, in iter_errback
yield next(it)
File "/home/reyman/.pyenv/versions/venv352/lib/python3.5/site-packages/scrapy/spidermiddlewares/offsite.py", line 29, in process_spider_output
for x in result:
File "/home/reyman/.pyenv/versions/venv352/lib/python3.5/site-packages/scrapy/spidermiddlewares/referer.py", line 339, in <genexpr>
return (_set_referer(r) for r in result or ())
File "/home/reyman/.pyenv/versions/venv352/lib/python3.5/site-packages/scrapy/spidermiddlewares/urllength.py", line 37, in <genexpr>
return (r for r in result or () if _filter(r))
File "/home/reyman/.pyenv/versions/venv352/lib/python3.5/site-packages/scrapy/spidermiddlewares/depth.py", line 58, in <genexpr>
return (r for r in result or () if _filter(r))
File "/home/reyman/Projets/Flight-Scrapping/flight/flight_project/spiders/AirportsSpider.py", line 209, in parse_schedule
yield from scrapy.Request(url, self.parse_schedule, meta={'airport_item': item, 'airport_urls': urls, 'i': i + 1})
TypeError: 'Request' object is not iterable
2017-06-27 17:40:50 [scrapy.core.engine] INFO: Closing spider (finished)
2017-06-27 17:40:50 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
关于python - Yield Request调用在scrapy的递归方法中产生奇怪的结果,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/43667622/