Python3 BeautifulSoup 在字典中插入每个标签(也嵌套)

标签 python python-3.x beautifulsoup nested

我正在尝试使用 BeautifulSoup 库解析一些 HTML,我想要做的是将每个标签及其内容插入字典中,但我不想将嵌套标签内容作为一个整体添加,相反,我想要所有要单独添加到字典中的标签的嵌套子级。我尝试了很多不同的方法,最接近的方法是将每个标签内容(也嵌套)插入字典中。如果我上面写的比较困惑,请原谅我,你马上就会明白我的意思。

我在这个小项目中使用的 HTML 代码如下(取自 https://www.crummy.com/software/BeautifulSoup/bs4/doc/ 网站):

<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>

我想要的结果如下:

{0: <title>The Dormouse's story</title>,
 1: <body>
 2: <p class="title"><b>The Dormouse's story</b></p>,
 3: <p class="story">Once upon a time there were three little sisters; and their names were,
 4: <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 5: <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>and,
 6: <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;,
 7: and they lived at the bottom of a well.</p>,
 8: <p class="story">...</p>}

这是迄今为止让我最接近的代码,

tags = []                                                  
for tag in soup.find_all():
    tags.append(tag.name)

elements = {}
for i, elem in enumerate(soup.find_all(tags)):
    elements[i] = elem.contents, elem.atts

这是我在 python3 控制台中调用 elements 时的结果,

>>> elements
{0: ([<head><title>The Dormouse's storytitle</title>head</head>, '\n', <body>
<p class="title"><b>The Dormouse's storyb</b>p</p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsiea</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Laciea</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tilliea</a>;
and they lived at the bottom of a well.p</p>
<p class="story">...p</p>
body</body>, 'html'], None), 
 1: ([<title>The Dormouse's storytitle</title>, 'head'], None), 
 2: (["The Dormouse's story", 'title'], None), 
 3: (['\n', <p class="title"><b>The Dormouse's storyb</b>p</p>, '\n', <p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsiea</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Laciea</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tilliea</a>;
and they lived at the bottom of a well.p</p>, '\n', <p class="story">...p</p>, '\n', 'body'], None), 
 4: ([<b>The Dormouse's storyb</b>, 'p'], None), 
 5: (["The Dormouse's story", 'b'], None), 
 6: (['Once upon a time there were three little sisters; and their names were\n', <a class="sister" href="http://example.com/elsie" id="link1">Elsiea</a>, ',\n', <a class="sister" href="http://example.com/lacie" id="link2">Laciea</a>, ' and\n', <a class="sister" href="http://example.com/tillie" id="link3">Tilliea</a>, ';\nand they lived at the bottom of a well.', 'p'], None), 
 7: (['Elsie', 'a'], None), 
 8: (['Lacie', 'a'], None), 
 9: (['Tillie', 'a'], None), 
 10: (['...', 'p'], None)}

这显然不是我需要的,因为嵌套标签一遍又一遍地重复。

最佳答案

代码:

from bs4 import BeautifulSoup

data = '''
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
'''

element = {}
soup = BeautifulSoup(data, 'html.parser')
title = soup.title.extract()
soup.head.decompose()
body = soup.body.extract()
temp = str(title) + '\n' + str(body)
for i in temp.split('\n'):
    element[str(temp.split('\n').index(i))] = i

for key, value in element.items():
    print(key, value)

输出:

0 <title>The Dormouse's story</title>
1 <body>
2 <p class="title"><b>The Dormouse's story</b></p>
3 <p class="story">Once upon a time there were three little sisters; and their names were
4 <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
5 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
6 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
7 and they lived at the bottom of a well.</p>
8 </body>

关于Python3 BeautifulSoup 在字典中插入每个标签(也嵌套),我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/49826686/

相关文章:

html - 如何使用 Beautiful Soup 按文本内容选择 div?

python - 两个列表之间的匹配和计数的时间复杂度

python - 如何在 django 1.11 中设置 session 超时

Python Pandas if/else 语句

html - Python Tornado 不会加载 .Css 文件

python - 用户的字符串(例如段落)可以导入/输入到 PyGame 中吗?

python - BeautifulSoup - 嵌套表

python - 何时使用绝对导入

javascript - Django 使用 Javascript 包含模板

python - 如何在 Python 中使用 BeautifulSoup 创建链接?