我正在尝试解析需要登录的网页的 HTML。我可以使用此脚本获取网页的 HTML:
from urllib2 import urlopen
from BeautifulSoup import BeautifulSoup
import re
webpage = urlopen ('https://www.example.com')
soup = BeautifulSoup (webpage)
print soup
#This would print the source of example.com
但事实证明,尝试获取我登录的网页的来源更加困难。 我尝试将 ('https://www.example.com') 替换为 ('https://user:pass@example.com'),但出现无效 URL 错误。
有人知道我该怎么做吗? 提前致谢。
最佳答案
Selenium WebDriver ( http://seleniumhq.org/projects/webdriver/ ) 可能适合您的需求。您可以登录该页面,然后打印 HTML 的内容。这是一个例子:
from selenium import webdriver
# initiate
driver = webdriver.Firefox() # initiate a driver, in this case Firefox
driver.get("http://example.com") # go to the url
# locate the login form
username_field = driver.find_element_by_name(...) # get the username field
password_field = driver.find_element_by_name(...) # get the password field
# log in
username_field.send_keys("username") # enter in your username
password_field.send_keys("password") # enter in your password
password_field.submit() # submit it
# print HTML
html = driver.page_source
print html
关于Python:如何解析需要登录的网页的 HTML?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/9387500/