python - 使用 scrapy-splash 选择依赖下拉菜单

标签 python web-scraping scrapy scrapy-splash splash-js-render

我正在尝试抓取以下网站:https://www.climatempo.com.br/climatologia/558/saopaulo-sp 。它有两个下拉菜单,第二个下拉菜单取决于第一个,因此我选择使用 scrapy 并通过 scrapy-splash 进行启动。

我需要通过首先选择州,然后选择城市来自动更改位置。我尝试了 SplashFormRequest 但无法更改城市列表。我的蜘蛛是(打印用于调试):

import scrapy
from scrapy_splash import SplashRequest, SplashFormRequest


class ExampleSpider(scrapy.Spider):
    name = 'climatologia'

    def start_requests(self):
        urls = ['https://www.climatempo.com.br/climatologia/558/saopaulo-sp']
        for url in urls:
            yield SplashRequest(url=url, callback=self.parse,
                                endpoint='render.html',
                                args={'wait': 0.5},)

    def parse(self, response):
        print(response.url)
        state = response.css("select.slt-geo")[0].css("option::attr(value)").extract()
        print(state)

        return SplashFormRequest(response.url, method='POST',
                                 formdata={'sel-state-geo': 'SP'},
                                 callback=self.state_selected,
                                 args={'wait': 0.5})

    def state_selected(self, response):
        print('\t:+)\t:+)\t:+)\t:+)\t:+)\t:+)')
        print(response.css("select.slt-geo")[0].css("option::text").extract())
        print(response.css("select.slt-geo")[1].css("option::text").extract())

最佳答案

如果您绝对必须使用站点菜单,我建议使用 Selenium 来完成这项工作。编写 Splash 脚本的唯一方法是通过 LUA 脚本。您必须发送到执行端点并创建 LUA 脚本。我找到了您尝试选择的选项,但没有找到提交表单的位置或其在网站上的运作方式。我确实必须翻译成英文。

我的建议是在浏览器检查器中查找端点,如下所示,这是看起来特别有趣的几个端点之一: https://www.climatempo.com.br/json/busca-estados

该端点提供如下 json

{"success":true,"message":"Resultados encontrados","time":"2017-11-30 16:05:20","totalRows":null,"totalPages":null,"page":null,"data":[{"idlocale":338,"idstate":31,"uf":"AC","state":"Acre","region":"N","latitude":null,"longitude":null},{"idlocale":339,"idstate":49,"uf":"AL","state":"Alagoas","region":"NE","latitude":null,"longitude":null},{"idlocale":340,"idstate":41,"uf":"AM","state":"Amazonas","region":"N","latitude":null,"longitude":null},{"idlocale":341,"idstate":30,"uf":"AP","state":"Amap\u00e1","region":"N","latitude":null,"longitude":null},{"idlocale":342,"idstate":56,"uf":"BA","state":"Bahia","region":"NE","latitude":null,"longitude":null},{"idlocale":343,"idstate":44,"uf":"CE","state":"Cear\u00e1","region":"NE","latitude":null,"longitude":null},{"idlocale":344,"idstate":47,"uf":"DF","state":"Distrito Federal","region":"CO","latitude":null,"longitude":null},{"idlocale":345,"idstate":45,"uf":"ES","state":"Esp\u00edrito Santo","region":"SE","latitude":null,"longitude":null},{"idlocale":346,"idstate":54,"uf":"GO","state":"Goi\u00e1s","region":"CO","latitude":null,"longitude":null},{"idlocale":347,"idstate":52,"uf":"MA","state":"Maranh\u00e3o","region":"NE","latitude":null,"longitude":null},{"idlocale":348,"idstate":53,"uf":"MG","state":"Minas Gerais","region":"SE","latitude":null,"longitude":null},{"idlocale":349,"idstate":39,"uf":"MS","state":"Mato Grosso do Sul","region":"CO","latitude":null,"longitude":null},{"idlocale":350,"idstate":40,"uf":"MT","state":"Mato Grosso","region":"CO","latitude":null,"longitude":null},{"idlocale":351,"idstate":50,"uf":"ND","state":"N\u00e3o Aplic\u00e1vel","region":"ND","latitude":null,"longitude":null},{"idlocale":352,"idstate":55,"uf":"PA","state":"Par\u00e1","region":"N","latitude":null,"longitude":null},{"idlocale":353,"idstate":37,"uf":"PB","state":"Para\u00edba","region":"NE","latitude":null,"longitude":null},{"idlocale":354,"idstate":29,"uf":"PE","state":"Pernambuco","region":"NE","latitude":null,"longitude":null},{"idlocale":355,"idstate":33,"uf":"PI","state":"Piau\u00ed","region":"NE","latitude":null,"longitude":null},{"idlocale":356,"idstate":32,"uf":"PR","state":"Paran\u00e1","region":"S","latitude":null,"longitude":null},{"idlocale":357,"idstate":46,"uf":"RJ","state":"Rio de Janeiro","region":"SE","latitude":null,"longitude":null},{"idlocale":358,"idstate":35,"uf":"RN","state":"Rio Grande do Norte","region":"NE","latitude":null,"longitude":null},{"idlocale":359,"idstate":38,"uf":"RO","state":"Rond\u00f4nia","region":"N","latitude":null,"longitude":null},{"idlocale":360,"idstate":43,"uf":"RR","state":"Roraima","region":"N","latitude":null,"longitude":null},{"idlocale":361,"idstate":48,"uf":"RS","state":"Rio Grande do Sul","region":"S","latitude":null,"longitude":null},{"idlocale":362,"idstate":36,"uf":"SC","state":"Santa Catarina","region":"S","latitude":null,"longitude":null},{"idlocale":363,"idstate":51,"uf":"SE","state":"Sergipe","region":"NE","latitude":null,"longitude":null},{"idlocale":364,"idstate":34,"uf":"SP","state":"S\u00e3o Paulo","region":"SE","latitude":null,"longitude":null},{"idlocale":365,"idstate":42,"uf":"TO","state":"Tocantins","region":"N","latitude":null,"longitude":null}]}

希望这是获取您正在寻找的数据的另一种方式?

然后就可以使用普通的请求来获取数据了。您只需以同样的方式提出请求即可。通常加上accept、useragent、requested with header就可以通过了。

关于python - 使用 scrapy-splash 选择依赖下拉菜单,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/47575155/

相关文章:

python : unable to convert text to csv properly

python - 使用 Python Beautifulsoup 抓取 html 文本和图像链接

python异常.UnicodeDecodeError : 'ascii' codec can't decode byte 0xa7 in

尝试使用 asyncio 子进程调用 shell 命令时,Python 引发 NotImplementedError

python - 表达式 awk,python 中的字符无效

python - 使用 Python 从 aspx 页面下载 .xls 文件

python - scrapy:在scrapy整理处理url后发布一些表单

python - scrapy : how to test the delay between every requests

python - 在 Scrapy 中将列表作为参数传递

python - 在 PyQuery 中获取属性?