python - Yield Request调用在scrapy的递归方法中产生奇怪的结果

标签 python recursion web-scraping scrapy yield

我正在尝试使用 Python 和 Scrapy 在一天内从所有国家/地区的所有机场取消所有出发和到达。

这个著名网站(飞行雷达)使用的JSON数据库需要在一个机场出发或到达> 100时逐页查询。我还根据查询的实际日期 UTC 计算时间戳。

我尝试创建具有此层次结构的数据库:

country 1
 - airport 1
    - departures
      - page 1
      - page ...
    - arrivals
      - page 1
      - page ...
- airport 2
    - departures
      - page 1
      - page ...
    - arrivals
      - page 
      - page ...
...

我使用两种方法来按页面计算时间戳和 url 查询:

def compute_timestamp(self):
    from datetime import datetime, date
    import calendar
    # +/- 24 heures
    d = date(2017, 4, 27)
    timestamp = calendar.timegm(d.timetuple())
    return timestamp

def build_api_call(self,code,page,timestamp):
    return 'https://api.flightradar24.com/common/v1/airport.json?code={code}&plugin\[\]=&plugin-setting\[schedule\]\[mode\]=&plugin-setting\[schedule\]\[timestamp\]={timestamp}&page={page}&limit=100&token='.format(
        code=code, page=page, timestamp=timestamp)

我将结果存储到 CountryItem 中,其中包含很多 AirportItem 到机场。我的 item.py 是:

class CountryItem(scrapy.Item):
    name = scrapy.Field()
    link = scrapy.Field()
    num_airports = scrapy.Field()
    airports = scrapy.Field()
    other_url= scrapy.Field()
    last_updated = scrapy.Field(serializer=str)

class AirportItem(scrapy.Item):
    name = scrapy.Field()
    code_little = scrapy.Field()
    code_total = scrapy.Field()
    lat = scrapy.Field()
    lon = scrapy.Field()
    link = scrapy.Field()
    departures = scrapy.Field()
    arrivals = scrapy.Field()

我的主要解析为所有国家构建了一个 Country 项目(例如,我在这里限制为以色列)。接下来,我为每个国家/地区生成一个 scrapy.Request 来抓取机场。

###################################
# MAIN PARSE
####################################
def parse(self, response):
    count_country = 0
    countries = []
    for country in response.xpath('//a[@data-country]'):
        item = CountryItem()
        url =  country.xpath('./@href').extract()
        name = country.xpath('./@title').extract()
        item['link'] = url[0]
        item['name'] = name[0]
        item['airports'] = []
        count_country += 1
        if name[0] == "Israel":
            countries.append(item)
            self.logger.info("Country name : %s with link %s" , item['name'] , item['link'])
            yield scrapy.Request(url[0],meta={'my_country_item':item}, callback=self.parse_airports)

此方法为每个机场抓取信息,并为每个机场调用一个 scrapy.request 和机场 url 以抓取出发和到达:

  ###################################
# PARSE EACH AIRPORT
####################################
def parse_airports(self, response):
    item = response.meta['my_country_item']
    item['airports'] = []

    for airport in response.xpath('//a[@data-iata]'):
        url = airport.xpath('./@href').extract()
        iata = airport.xpath('./@data-iata').extract()
        iatabis = airport.xpath('./small/text()').extract()
        name = ''.join(airport.xpath('./text()').extract()).strip()
        lat = airport.xpath("./@data-lat").extract()
        lon = airport.xpath("./@data-lon").extract()
        iAirport = AirportItem()
        iAirport['name'] = self.clean_html(name)
        iAirport['link'] = url[0]
        iAirport['lat'] = lat[0]
        iAirport['lon'] = lon[0]
        iAirport['code_little'] = iata[0]
        iAirport['code_total'] = iatabis[0]

        item['airports'].append(iAirport)

    urls = []
    for airport in item['airports']:
        json_url = self.build_api_call(airport['code_little'], 1, self.compute_timestamp())
        urls.append(json_url)
    if not urls:
        return item

    # start with first url
    next_url = urls.pop()
    return scrapy.Request(next_url, self.parse_schedule, meta={'airport_item': item, 'airport_urls': urls, 'i': 0})

使用递归方法 parse_schedule 我将每个机场添加到国家项目。 SO成员已经help me关于这一点。

###################################
# PARSE EACH AIRPORT OF COUNTRY
###################################
def parse_schedule(self, response):
        """we want to loop this continuously to build every departure and arrivals requests"""
        item = response.meta['airport_item']
        i = response.meta['i']
        urls = response.meta['airport_urls']

        urls_departures, urls_arrivals = self.compute_urls_by_page(response, item['airports'][i]['name'], item['airports'][i]['code_little'])

        print("urls_departures = ", len(urls_departures))
        print("urls_arrivals = ", len(urls_arrivals))

        ## YIELD NOT CALLED
        yield scrapy.Request(response.url, self.parse_departures_page, meta={'airport_item': item, 'page_urls': urls_departures, 'i':0 , 'p': 0}, dont_filter=True)

        # now do next schedule items
        if not urls:
            yield item
            return
        url = urls.pop()

        yield scrapy.Request(url, self.parse_schedule, meta={'airport_item': item, 'airport_urls': urls, 'i': i + 1})

self.compute_urls_by_page 方法计算正确的 URL 以检索一个机场的所有出发和到达。

###################################
# PARSE EACH DEPARTURES / ARRIVALS
###################################
def parse_departures_page(self, response):
    item = response.meta['airport_item']
    p = response.meta['p']
    i = response.meta['i']
    page_urls = response.meta['page_urls']

    print("PAGE URL = ", page_urls)

    if not page_urls:
        yield item
        return
    page_url = page_urls.pop()

    print("GET PAGE FOR  ", item['airports'][i]['name'], ">> ", p)

    jsonload = json.loads(response.body_as_unicode())
    json_expression = jmespath.compile("result.response.airport.pluginData.schedule.departures.data")
    item['airports'][i]['departures'] = json_expression.search(jsonload)

    yield scrapy.Request(page_url, self.parse_departures_page, meta={'airport_item': item, 'page_urls': page_urls, 'i': i, 'p': p + 1})

接下来,通常调用 self.parse_departure_page 递归方法的 parse_schedule 中的第一个 yield 会产生奇怪的结果。 Scrapy 调用了这个方法,但我只收集了一个机场的出发页面,我不明白为什么...... 我的请求或 yield 源代码中可能有一个订购错误,所以也许你可以帮忙我来找出答案。

完整代码在GitHub上https://github.com/IDEES-Rouen/Flight-Scrapping/tree/master/flight/flight_project

您可以使用 scrapy cawl airports 命令运行它。

更新 1:

我尝试使用 yield from 单独回答这个问题,但没有成功,因为您可以在底部看到答案……如果您有想法?

最佳答案

是的,我终于找到了答案here所以...

当你使用递归yield时,你需要使用yield from。这里有一个简化的例子:

airport_list = ["airport1", "airport2", "airport3", "airport4"]

def parse_page_departure(airport, next_url, page_urls):

    print(airport, " / ", next_url)


    if not page_urls:
        return

    next_url = page_urls.pop()

    yield from parse_page_departure(airport, next_url, page_urls)

###################################
# PARSE EACH AIRPORT OF COUNTRY
###################################
def parse_schedule(next_airport, airport_list):

    ## GET EACH DEPARTURE PAGE

    departures_list = ["p1", "p2", "p3", "p4"]

    next_departure_url = departures_list.pop()
    yield parse_page_departure(next_airport,next_departure_url, departures_list)

    if not airport_list:
        print("no new airport")
        return

    next_airport_url = airport_list.pop()

    yield from parse_schedule(next_airport_url, airport_list)

next_airport_url = airport_list.pop()
result = parse_schedule(next_airport_url, airport_list)

for i in result:
    print(i)
    for d in i:
        print(d)

更新,不要使用真正的程序:

我尝试重现相同的 yield from 模式 with the real program here , 但我在 scrapy.Request 上使用它时出错,不明白为什么...

这里是 python 回溯:

Traceback (most recent call last):
  File "/home/reyman/.pyenv/versions/venv352/lib/python3.5/site-packages/scrapy/utils/defer.py", line 102, in iter_errback
    yield next(it)
  File "/home/reyman/.pyenv/versions/venv352/lib/python3.5/site-packages/scrapy/spidermiddlewares/offsite.py", line 29, in process_spider_output
    for x in result:
  File "/home/reyman/.pyenv/versions/venv352/lib/python3.5/site-packages/scrapy/spidermiddlewares/referer.py", line 339, in <genexpr>
    return (_set_referer(r) for r in result or ())
  File "/home/reyman/.pyenv/versions/venv352/lib/python3.5/site-packages/scrapy/spidermiddlewares/urllength.py", line 37, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "/home/reyman/.pyenv/versions/venv352/lib/python3.5/site-packages/scrapy/spidermiddlewares/depth.py", line 58, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "/home/reyman/Projets/Flight-Scrapping/flight/flight_project/spiders/AirportsSpider.py", line 209, in parse_schedule
    yield from scrapy.Request(url, self.parse_schedule, meta={'airport_item': item, 'airport_urls': urls, 'i': i + 1})
TypeError: 'Request' object is not iterable
2017-06-27 17:40:50 [scrapy.core.engine] INFO: Closing spider (finished)
2017-06-27 17:40:50 [scrapy.statscollectors] INFO: Dumping Scrapy stats:

关于python - Yield Request调用在scrapy的递归方法中产生奇怪的结果,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/43667622/

相关文章:

javascript - 循环和递归 : the difference in the number of repetitions

Python shell 由于递归深度而重新启动

python - 在 python 中为多个变量设置 nosuchelementexception 的默认值

python - BeautifulSoup 无法读取 html

python - 无法获取给出错误结果的网页

c++ - Python 和 C++ 应用程序的简单但快速的 IPC 方法?

python - 具有多种特征的多步时间序列预测

python - 在 update() 中平滑移动 Sprite ,同时允许手动移动

android - 有没有办法在 Android 上运行 Python?

list - 使用引号的奇怪结果,递归地反转包含所有子列表的列表