python - 如何在抓取时获取html页面中的评论?

标签 python html selenium web-scraping scrapy

问题来了。我试图抓取这个关于生日日期的 Facebook 页面,当我在浏览器中看到页面源代码时,它以 html 中类名 divclass="hidden_elem" 中的注释形式向我显示生日日期。

可能是因为,当我使用(selenium、scrapy、requests)在获取请求中看到此页面的源代码时,我只得到一个带有divclass="hidden_elem",并且该注释无处可见,更不用说解析了获取信息。

那么如何获取此文本,如果可能,请说明如何获取生日日期。

Facebook 页面上的设计可能存在一些 JavaScript 问题,从而巧妙地导致了这种情况。如何解决这个问题?

这是我尝试获取生日日期的 URL。 https://www.facebook.com/profile.php?id=100004456147835&sk=about

从浏览器的源页面来看,它看起来像这样:-

<div class="hidden_elem"><code id="u_0_2g"><!-- <ul class="uiList _54nz _4kg _4kt" data-pnref="about"><li><div class="_5aj7"><div class="_4bl9"><div class="_54n- _2pi3"><div id="u_0_2e"></div></div></div><div class="_4bl7"><div class="_4ms4" id="u_0_2a"><div class="clearfix _ikh _5c0g" data-pnref="overview" id="u_0_2f"><div class="_4bl7"><ul class="uiList _1pi3 _4kg _6-h _703 _4ks"><li class="_3pw9 _2pi4"><div class="clearfix _4bbo" role="button" tabindex="0"><div class="_5rsw _3-91 _8o lfloat _ohe"><i class="_5rsx img sp_yw06AF9sktb sx_344683"></i></div><div class="_42ef"><div class="_6a"><div class="_6a _6b" style="height:36px"></div><div class="_6a _6b"><span class="_50f8 _2iem">No workplaces to show</span></div></div></div></div></li><li id="u_0_2b"><div class="clearfix _5y02" data-overviewsection="education" role="button" tabindex="0"><a class="_5uat _3-91 _8o lfloat _ohe" tabindex="-1" aria-hidden="true" href="https://www.facebook.com/pages/Cambridge-Institute-of-technolagy/133870693705509" data-hovercard="/ajax/hovercard/page.php?id=133870693705509" data-hovercard-prefer-more-content-show="1"><img class="_s0 _4ooo _54ru img" src="https://scontent.fblr6-1.fna.fbcdn.net/v/t1.0-1/c9.0.32.32/p32x32/580846_10149999285985791_1565762244_n.png?oh=d4ccc6a667e53f20db9cf60c0742f989&amp;oe=5B1420C5" alt="" aria-label="Cambridge Institute of technolagy" role="img" /></a><div class="_42ef"><div class="_6a _5u5j _6b"><div class="_c24 _50f4">Studies at <a class="profileLink" href="https://www.facebook.com/pages/Cambridge-Institute-of-technolagy/133870693705509" data-hovercard="/ajax/hovercard/page.php?id=133870693705509" data-hovercard-prefer-more-content-show="1">Cambridge Institute of technolagy</a></div><div><div><div class="_50f8 _2ieq"><div class="fsm fwn fcg">Past: <a class="profileLink" href="https://www.facebook.com/deekshaintegrated/" data-hovercard="/ajax/hovercard/page.php?id=176180289071224" data-hovercard-prefer-more-content-show="1">Deeksha Integrated</a> and <a class="profileLink" href="https://www.facebook.com/pages/chethana-vidya-mandiratumkur/378826618888908" data-hovercard="/ajax/hovercard/page.php?id=378826618888908" data-hovercard-prefer-more-content-show="1">chethana vidya mandira,tumkur</a></div></div></div></div></div></div></div></li><li id="u_0_2c"><div class="clearfix _5y02" data-overviewsection="places" role="button" tabindex="0"><a class="_5uat _3-91 _8o lfloat _ohe" tabindex="-1" aria-hidden="true" href="https://www.facebook.com/pages/Bangalore-India/106377336067638" data-hovercard="/ajax/hovercard/page.php?id=106377336067638" data-hovercard-prefer-more-content-show="1"><img class="_s0 _4ooo _54ru img" src="https://external.fblr6-1.fna.fbcdn.net/safe_image.php?d=AQCKH3kcP1-A2NPe&amp;w=32&amp;h=32&amp;url=https%3A%2F%2Fupload.wikimedia.org%2Fwikipedia%2Fcommons%2F8%2F80%2FBangaloreMontage.png&amp;cfs=1&amp;fallback=hub_city&amp;f&amp;_nc_hash=AQDbJ1ytdhSz3E8E" alt="" aria-label="Bangalore, India" role="img" /></a><div class="_42ef"><div class="_6a _5u5j _6b"><div class="_c24 _50f4">Lives in <a class="profileLink" href="https://www.facebook.com/pages/Bangalore-India/106377336067638" data-hovercard="/ajax/hovercard/page.php?id=106377336067638" data-hovercard-prefer-more-content-show="1">Bangalore, India</a></div><div><div><div class="_50f8 _2ieq"><div class="fsm fwn fcg"><span id="u_0_2d">From <span class="fwb"><a class="profileLink" href="https://www.facebook.com/pages/Tumkur/106525352717093" data-hovercard="/ajax/hovercard/page.php?id=106525352717093" data-hovercard-prefer-more-content-show="1">Tumkur</a></span></span></div></div></div></div></div></div></div></li><li class="_3pw9 _2pi4"><div class="clearfix _4bbo" role="button" tabindex="0"><div class="_5rsw _3-91 _8o lfloat _ohe"><i class="_5rsx img sp_yw06AF9sktb sx_585866"></i></div><div class="_42ef"><div class="_6a"><div class="_6a _6b" style="height:36px"></div><div class="_6a _6b"><span class="_50f8 _2iem">No relationship info to show</span></div></div></div></div></li></ul></div><div class="_4bl9 _zu9"><ul class="uiList _5yql _4kg" data-overviewsection="contact_basic" role="button" tabindex="0"><li class="_4tnv _2pif"><div class="clearfix _ikh"><div class="_4bl7"><div class="_pvf _5pmc"><i class="img sp_yw06AF9sktb sx_e0cf75"></i></div></div><div class="_4bl9 _2pis _2dbl"><span class="_c24 _2ieq"><div><span class="accessible_elem">Birthday</span></div><div>April 28, 1998</div></span></div></div></li></ul></div></div></div></div></div></li></ul> --></code></div>

当我从脚本中获取页面源代码时,只有 <div class="hidden_elem"> </div> 出现。

最佳答案

使用 BeautifulSoup 你可以做到这一点

试试这个:-

from bs4 import BeautifulSoup, Comment
soup = BeautifulSoup(html, 'lxml')
for comment in soup.findAll(text=lambda text:isinstance(text,Comment)):
    print (comment)

关于python - 如何在抓取时获取html页面中的评论?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/48849183/

相关文章:

python - WebDriverException : Message: invalid argument: can't kill an exited process with GeckoDriver, RaspberryPi3 上的 Selenium 和 Python

Python - 对元组的元素执行操作

jquery - 添加 div 但让它们留在视口(viewport)内

python - Web 驱动程序 Selenium - Z 索引问题

python - 使用 Selenium 和 Python 禁用 Shockwave Flash 插件

python - 计算表面曲率的导数

Python:如何改进归一化算法?

javascript - 将数据从html表单发送到mongodb数据库

javascript - AngularJS 在 HTML 中呈现为纯文本

Python 和 Selenium - 获取不包括子节点文本的文本