javascript - 抓取 Android 商店

我正在尝试使用 Beautiful Soup 抓取 Android 商店页面，以便获得包含软件包列表的文件。这是我的代码:

from requests import get
from bs4 import BeautifulSoup
import json
import time

url = 'https://play.google.com/store/apps/collection/topselling_free'
response = get(url)
html_soup = BeautifulSoup(response.text, 'html.parser')
type(html_soup)

app_container = html_soup.find_all('div', class_="card no-rationale 
square-cover apps small")
file = open("applications.txt","w+")
for i in range(0,60):
#if range > 60 ; "IndexError: list index out of range"
    print(app_container[i].div['data-docid'])
    file.write(app_container[i].div['data-docid'] + "\n")

file.close()

问题是我只能收集 60 个包名称，因为未加载 javascript，如果我必须加载更多应用程序，我必须向下滚动。如何在 Python 中重现此行为以获得超过 60 个结果？

最佳答案

您会考虑使用功能更全的抓取工具吗？ Scrapy 专为这项工作而设计:https://blog.scrapinghub.com/2016/06/22/scrapy-tips-from-the-pros-june-2016

Selenium 就像用代码驱动浏览器 - 如果您可以亲自完成，您可能可以在 selenium 中完成:scrape websites with infinite scrolling

其他人认为 bs4 和请求不足以完成任务:How to load all entries in an infinite scroll at once to parse the HTML in python

另请注意，抓取可能有点灰色地带，您应该始终努力了解并尊重网站政策。他们的服务条款和 robots.txt 总是值得仔细阅读的好地方。

关于javascript - 抓取 Android 商店，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/53210299/

上一篇：javascript - 使用 getValues 填充多维数组

下一篇：javascript - 调用具有多个参数的函数的更简单方法？

python - 如何通过自动下载链接使用 Python 访问 PDF 文件？

python - 如何在 Python 3 中通过 IP 获取 WhoIs 信息？

javascript - 读取 list : Error processing options_page: An unexpected property was found in the WebExtension

javascript - 如何将字符串分解为对象数组

javascript - 使用 Javascript 隐藏状态栏

python - 在 BeautifulSoup、python 中仅从表 (td) 中提取特定的行和列

python - 尝试提供全局日志记录功能

python - 正则表达式捕获特定的百分比/小数

python - 抓取 Coinmarketcap 数据只返回前 10 个结果，为什么其余 90 个不返回？