python - 使用 BeautifulSoup 从单个博客存档页面提取多个帖子,无需脚本

标签 python html python-2.7 html-parsing beautifulsoup

我正在尝试从一系列 WordPress 和 Blogger 博客存档页面中抓取作者、标题、日期和帖子内容。我已经保存了页面,因此我不会重复 ping 服务器。我已经让其他部分正常工作,但我似乎无法从每个文件中获取所有帖子并且也无法获取“添加到任何”或“社交”或其他来自底部的困惑脚本。这就是我所在的地方。

import urllib2
from bs4 import BeautifulSoup
import re

file_list = open ("hafiles.txt", "r")
posts_file = open ("haposts.txt","w")


for indurl in file_list:
    indurl = indurl.rstrip("\n")
    with open(indurl,"r") as ha_file:
     soup_ha = BeautifulSoup(ha_file)

    #works the second find gets rid of the sociable crap
    # this is the way it looks on the page <div class='post-body'>

    posts = soup_ha.find("div", class_="post-body").find_all("p")


    #tried a trick i saw on http://stackoverflow.com/questions/24458353/cleaning-text-string-after-getting-body-text-using-beautifulsoup
    #no joy
    #posts = soup_ha.find("div", class_="post-body")
    #text = [''.join(s.findAll(text=True))for s in posts.findAll('p')] 
    text = str(posts) + "\n" + "\n"
    posts_file.write (text)

print ("All done!")



file_list.close()
posts_file.close()

因此,如果我执行 find_all 并获取所有帖子(甚至不确定我是否真的获取了所有帖子),那么我就得到了脚本。如果我只使用 find,我至少可以通过两种方式获得没有脚本的漂亮帖子。我有一个文件列表,每个文件都有几个要提取的帖子。 我在 stackoverflow 和网络上进行了搜索。

eta:输入是一个非常困惑的网页,顶部有大量脚本,页面上有所有 css 定义,然后

<div id='main-wrapper'>
<div class='main section' id='main'><div class='widget Blog' id='Blog1'>
<div class='blog-posts'>
<h2 class='date-header'>27 February, 2007</h2>
<div class='post uncustomized-post-template'>
<a name='edit'></a>
<h3 class='post-title'>
<a href='http:// edited for anon.html'>edit</a>
</h3>
<div class='post-header-line-1'></div>
<div class='post-body'>
<style>span.fullpost{display:none;}</style>
<p>edit this is post text - what i want</p>
<script type='text/javascript'>
          var permlink='edit';
          var title='edit';

          var spans = document.getElementsByTagName('span');
          var number = 0;
          for(i=0; i <spans.length; i++){
                var c = " " + spans[i].className + " ";
                if (c.indexOf("fullpost") != -1) {
                number++;
                }
                }

                if(number != memory){document.write('<p></p><a href=' + permlink + '>"'+ title + '" continues...</a>') }
           memory = number;
           </script>
<div style='clear: both;'></div>
</div>
<div class='post-footer'>
<p class='post-footer-line post-footer-line-1'>
<span class='post-author'>
Posted by
this is the author name, also want, have way to get
</span>
<span class='post-timestamp'>
at
<a class='timestamp-link' href='http://edit' title='permanent link'>2:53 pm</a>
</span>
<span class='post-comment-link'>
<a class='comment-link' href='edit' onclick=''>1 comments</a>
</span>
<span class='post-backlinks post-comment-link'>
<a class='comment-link' href='edit'>Links to this post</a>
</span>
<span class='post-icons'>
<span class='item-control blog-admin pid-edit'>
<a href='edit' title='Edit Post'>
<img alt='' class='icon-action' height='18' src='http://img2.blogblog.com/img/icon18_edit_allbkg.gif' width='18'/>
</a>
</span>
</span>
</p>
<p class='post-footer-line post-footer-line-2'>
<span class='post-labels'>
Labels:
<a href='edit' rel='tag'>edi</a>
</span>
</p>
<p class='post-footer-line post-footer-line-3'></p>
</div>
</div>
<h2 class='date-header'>26 February, 2007</h2>
<div class='post uncustomized-post-template'>
<a name='5518681505930320089'></a>
<h3 class='post-title'>
<a href='edit'>edit</a>
</h3>
<div class='post-header-line-1'></div>
<div class='post-body'>
<style>span.fullpost{display:none;}</style>
<p>edit post text, what I want.</p>
<script type='text/javascript'>
          var permlink='http://edit';
          var title='edit';

          var spans = document.getElementsByTagName('span');
          var number = 0;
          for(i=0; i <spans.length; i++){
                var c = " " + spans[i].className + " ";
                if (c.indexOf("fullpost") != -1) {
                number++;
                }
                }

                if(number != memory){document.write('<p></p><a href=' + permlink + '>"'+ title + '" continues...</a>') }
           memory = number;
           </script>
<div style='clear: both;'></div>
</div>
<div class='post-footer'>
<p class='post-footer-line post-footer-line-1'>
<span class='post-author'>
Posted by
edit author name
</span>
<span class='post-timestamp'>
at
<a class='timestamp-link' href='edit' title='permanent link'>9:00 am</a>
</span>
<span class='post-comment-link'>
<a class='comment-link' href='edit' onclick=''>5
comments</a>
</span>
<span class='post-backlinks post-comment-link'>
<a class='comment-link' href='edit'>Links to this post</a>
</span>
<span class='post-icons'>
<span class='item-control blog-admin pid-edit'>
<a href='edit' title='Edit Post'>
<img alt='' class='icon-action' height='18' src='http://img2.blogblog.com/img/icon18_edit_allbkg.gif' width='18'/>
</a>
</span>
</span>
</p>
<p class='post-footer-line post-footer-line-2'>
<span class='post-labels'>
Labels:
<a href='edit' rel='tag'>edit</a>,
<a href='edit' rel='tag'>edit</a>
</span>
</p>
<p class='post-footer-line post-footer-line-3'></p>
</div>
</div>
<h2 class='date-header'>22 February, 2007</h2>
<div class='post uncustomized-post-template'>
<a name='edit'></a>

哎呀!所以我可能有 20 个左右的文件,每个文件中都有 1 到 10 个帖子(这个有 2 个)... 一个 csv 或 excel 文件会很可爱,就像这样 日期 作者标题 帖子内容

以列为单位,每行一行。 我将获取一个仅包含帖子内容的文件,每个帖子之间有一些空格。我对文本中的一些链接以及一些粗体和列表之类的东西很满意,但我不想要所有凌乱的脚本。 谢谢

最佳答案

以下是一个包​​含多个帖子的单页面示例:

from bs4 import BeautifulSoup


soup = BeautifulSoup(open('test.html'))
posts = []
for post in soup.find_all('div', class_='post'):
    title = post.find('h3', class_='post-title').text.strip()
    author = post.find('span', class_='post-author').text.replace('Posted by', '').strip()
    content = post.find('div', class_='post-body').p.text.strip()
    date = post.find_previous_sibling('h2', class_='date-header').text.strip()

    posts.append({'title': title,
                  'author': author,
                  'content': content,
                  'date': date})
print posts

对于您发布的 html,它会打印:

[{'content': u'edit this is post text - what i want', 
  'date': u'27 February, 2007', 
  'author': u'this is the author name, also want, have way to get', 
  'title': u'edit'}, 
 {'content': u'edit post text, what I want.', 
  'date': u'26 February, 2007', 
  'author': u'edit author name', 
  'title': u'edit'}]

关于python - 使用 BeautifulSoup 从单个博客存档页面提取多个帖子,无需脚本,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/24502139/

相关文章:

python - 发送不改变用户当前页面的 HTTP 响应

html - 文本之间的行间距超过两行的CSS问题,ie7

javascript - 如何确定 HTML 5 中调用无效事件的验证规则是什么?

jquery - 如何在 bootstrap 中限制年龄最小 3 岁

python - Pandas:将分组的 df 转换为以两列作为键、值对的字典列表

python - 如何从Python中的日期字符串生成范围日期?

python - 如何使用 Python 动态创建 mongo 数据库的名称?

python - 使用 Python 将复杂参数解析为 shell 脚本

python - 如何在 argparse 中添加带有子解析器的可选位置参数?

python - pyopenms : DLL load failed: The specified procedure could not be found