python - 删除 <div> 和 <ahref> 之间的内容 Beautiful Soup

标签 python html beautifulsoup

我有一段代码来解析网页。我想删除 div、ahref、h1 之间的所有内容。

opener = urllib2.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
url = "http://en.wikipedia.org/wiki/Viscosity"
try:
  ourUrl = opener.open(url).read()
except Exception,err:
  pass
soup = BeautifulSoup(ourUrl)                
dem = soup.findAll('p')     

for i in dem:
  print i.text

我想打印 h1、ahref 之间没有任何内容的文本,就像我上面提到的那样。

最佳答案

编辑:来自评论“我想返回不在任何<div></div>标签之间的文本。”。这应该删除父级具有 div 标签的所有 block :

raw = '''
<html>
Text <div> Avoid this </div>
<p> Nested <div> Don't get me either </div> </p>
</html>
'''

def check_for_div_parent(mark):
    mark = mark.parent
    if 'div' == mark.name:
        return True
    if 'html' == mark.name:
        return False
    return check_for_div_parent(mark)

soup = bs4.BeautifulSoup(raw)

for text in soup.findAll(text=True):
    if not check_for_div_parent(text):
        print text.strip()

这只会产生两个标签,忽略 div 标签:

Text
Nested

原始回复

目前尚不清楚您到底想做什么。首先,您应该尝试发布一个完整的工作示例,因为您似乎缺少标题。其次,维基百科似乎对“机器人”或自动下载程序持反对态度

Python's `urllib2`: Why do I get error 403 when I `urlopen` a Wikipedia page?

可以通过以下代码行来避免这种情况

import urllib2, bs4

url = r"http://en.wikipedia.org/wiki/Viscosity"

req = urllib2.Request(url, headers={'User-Agent' : "Magic Browser"}) 
con = urllib2.urlopen( req )

现在我们有了页面,我认为您只想使用 bs4 提取正文。我会做这样的事情

soup = bs4.BeautifulSoup(con.read())
start_pos = soup.find('h1').parent

for p in start_pos.findAll('p'):
    para = ''.join([text for text in p.findAll(text=True)])
    print para

这给我的文本看起来像:

The viscosity of a fluid is a measure of its resistance to gradual deformation by shear stress or tensile stress. For liquids, it corresponds to the informal notion of "thickness". For example, honey has a higher viscosity than water.[1] Viscosity is due to friction between neighboring parcels of the fluid that are moving at different velocities. When fluid is forced through a tube, the fluid generally moves faster near the axis and very slowly near the walls, therefore some stress (such as a pressure difference between the two ends of the tube) is needed to overcome the friction between layers and keep the fluid moving. For the same velocity pattern, the stress required is proportional to the fluid's viscosity. A liquid's viscosity depends on the size and shape of its particles and the attractions between the particles.[citation needed]

关于python - 删除 <div> 和 <ahref> 之间的内容 Beautiful Soup,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/18445389/

相关文章:

python - 在已安装的 Python 包中导入

python - python中的安全凭证存储

python - 在 matplotlib 中使用 mpldatacursor 的工具提示

javascript - 动态地将项目插入到 SVG 元素中

python - 需要使用 BeautifulSoup 和 Python 解析此 HTML 的帮助

Python Turtle 比较颜色

html - "inline-block"div 之间的神秘空白

html - 我如何摆脱CSS上无用的滚动空间

python - 使用 Beautiful Soup 和 Python 抓取 Asp.NET 网站

python - 如何编写div类属性BeautifulSoup