python - 无法通过 python 网络抓取从 HTML 文件中提取#document

标签 python html web-scraping beautifulsoup

当我检查浏览器上的元素时，我可以清楚地看到确切的网页内容。但是当我尝试运行下面的脚本时，我看不到网页的一些细节。在网页中，我看到有“#document”元素，但在我运行脚本时缺少这些元素。如何查看#document 元素的详细信息或使用脚本提取。？

from bs4 import BeautifulSoup
import requests

response = requests.get('http://123.123.123.123/')
soup = BeautifulSoup(response.content, 'html.parser')
print soup.prettify()

最佳答案

您还需要发出其他请求 以获取frame 页面内容:

from urlparse import urljoin

from bs4 import BeautifulSoup
import requests

BASE_URL = 'http://123.123.123.123/'

with requests.Session() as session:
    response = session.get(BASE_URL)
    soup = BeautifulSoup(response.content, 'html.parser')

    for frame in soup.select("frameset frame"):
        frame_url = urljoin(BASE_URL, frame["src"])

        response = session.get(frame_url)
        frame_soup = BeautifulSoup(response.content, 'html.parser') 
        print(frame_soup.prettify())

关于python - 无法通过 python 网络抓取从 HTML 文件中提取#document，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/42952404/

上一篇：python - 对短语使用 word2vec

下一篇：python - 返回代码与退出状态

相关文章：

html - CSS !important 和内联样式被覆盖

Javascript 函数未正确显示警报

javascript - 使用 R 将字段添加到在线表单并抓取生成的 javascript 创建的表

Python 正则表达式 : find all lines that start with '{' and end with '}'

python - 在 Tkinter 中自动换行文本的单行文本输入 UI 元素？

python - 如何在 python 中使用 matplotlib/basemap 在正交投影上标记平行线/子午线

javascript - 默认情况下，localstorage 是否也是用户特定的？

dom - 使用 phantomjs 或其他东西挖掘/爬网/网络控制台？

python - 无法区分用于执行一项特定操作的两个选择器

python - 打印Python文件夹中没有出现的数字