python - 如何正确抓取基于 JavaScript 的网站？

标签 python python-3.x selenium geckodriver

我正在测试下面的代码。

from bs4 import BeautifulSoup
import requests
from selenium import webdriver
profile = webdriver.FirefoxProfile()
profile.accept_untrusted_certs = True
import time

browser = webdriver.Firefox(executable_path="C:/Utility/geckodriver.exe")
wd = webdriver.Firefox(executable_path="C:/Utility/geckodriver.exe", firefox_profile=profile)
url = "https://corp_intranet"
wd.get(url)

# set username
time.sleep(2)
username = wd.find_element_by_id("id_email")
username.send_keys("my_email@corp.com")

# set password
password = wd.find_element_by_id("id_password")
password.send_keys("my_password")


url=("https://corp_intranet")
r = requests.get(url)
content = r.content.decode('utf-8')
print(BeautifulSoup(content, 'html.parser'))

这可以很好地登录我的公司内部网，但它只打印非常非常基本的信息。按 F12 显示页面上的许多数据都是使用 JavaScript 呈现的。我对此做了一些研究，并试图找到一种方法来实际抓取我在屏幕上看到的内容，而不是我所看到的内容的非常非常稀释的版本。有没有办法对页面上显示的所有数据进行大数据转储？谢谢。

最佳答案

您打开2个浏览器删除此行

browser = webdriver.Firefox(executable_path="C:/Utility/geckodriver.exe")

问题出在您已登录的selenium中，但不在requests中，因为它使用不同的 session

.....
.....
# missing click button? add "\n" to submit or click the button
password.send_keys("my_password\n")

# wait max 10 seconds until "theID" visible in Logged In page
WebDriverWait(wd, 10).until(EC.presence_of_element_located((By.ID, "theID")))

content = wd.page_source
print(BeautifulSoup(content, 'html.parser'))

关于python - 如何正确抓取基于 JavaScript 的网站？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/53402067/

上一篇：python - 使用pyarrow和json.dump将json文件保存到hdfs中

下一篇：python - 默认的 django 开发服务器安全吗？其他人可以访问它吗？

python - 使用 im2txt 为大型 jpeg 集添加字幕

python - 使用 numpy 元组中的值创建矩阵

python - 如何在 python 中对字典中的特定值求和

python - 使用 opencv 删除圆圈

java - Selenium WebDriver : findElement() in each WebElement from List<WebElement> always returns contents of first element

python-3.x - 如何根据 tensorflow 中的条件获得最高的最小张量值

python - 文件未找到错误 : [WinError 2] python 3. 4

python-3.x - 如何在 python 中使用 mpi4py 库连接收集的数据

c# - 如何通过非null属性和ID查找元素？