python - 如何使用scrapy抓取Google Play网站

关闭。这个问题需要多问focused 。目前不接受答案。

想要改进此问题吗？更新问题，使其仅关注一个问题 editing this post .

已关闭 8 年前。

Improve this question

上下文:

我正在尝试抓取 Google Play 网站上的页面
当我使用浏览器浏览该页面并使用浏览器滚动向下滚动时，我得到了新的应用程序/项目。这绝对是一个 AJAX 调用。

问题:

我不知道如何使用 Scrapy 获取使用浏览器滚动时获得的应用程序。

我尝试过的:

我抓取了该页面并打印了响应:

enter image description here

如您所见，有一个加载信号，但使用浏览器时不会出现该信号，因为它会自动调用 AJAX 调用。

注意:

我确实知道我们可以使用 Scrapy 来调用 HXR AJAX 调用，但我希望我的蜘蛛抓取该页面，直到没有应用程序，这样蜘蛛应该(如果有的话)自动知道 AJAX 调用。

我在 Windows 7 64 位上使用 python 2.7.9 和 Scrapy 0.26

注2:

我已经检查过this thread

非常感谢

最佳答案

这是一个基本方法(不是很 Pythonic)，向您展示使用 Selenium Webdriver 解决问题的可能解决方案。

基本思想是:

创建 headless 浏览器位置 ( webdriver.Firefox() )
使实例加载页面 ( self.driver.get(response.url) )
当元素不可见时，继续将页面内的焦点移至该元素

这样页面就会不断加载元素。

import scrapy
import time
from selenium import webdriver
from selenium.webdriver.common.action_chains import ActionChains
from scrapy.contrib.spiders import CrawlSpider    

class googleplay(CrawlSpider):
    name = "googleplay"
    allowed_domains = ["play.google.com"]
    start_urls = ["https://play.google.com"]

    def __init__(self):
        self.driver = webdriver.Firefox()

    def parse(self, response):
        self.driver.get(response.url)      
        copyright = self.driver.find_element_by_class_name('copyright')
        ActionChains(self.driver).move_to_element(copyright).perform()

        while not copyright.is_displayed():
            copyright = self.driver.find_element_by_class_name('copyright')
            time.sleep(3) #to let page content loading
            ActionChains(self.driver).move_to_element(copyright).perform()

        #scrape by here

在循环结束时，您确定所有页面都已加载，并且您可以编写用于抓取内容的代码

此处的文档:http://selenium-python.readthedocs.org/en/latest/navigating.html

关于python - 如何使用scrapy抓取Google Play网站，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/28570321/

python - 如何使用scrapy抓取Google Play网站

上下文:

问题:

我尝试过的:

注意:

注2:

上一篇：python - 如何让 python setuptools 找到顶级模块

下一篇： python : saving variables from a text file created