python - 抓取内容中的标签必须与原始 HTML 文件中的标签顺序相同

我尝试构建一个网络抓取工具。我的抓取工具必须找到与所选标签相对应的所有行，并将它们以与原始 HTML 相同的顺序保存到新的 file.md 中。文件。

标签在数组中指定:

list_of_tags_you_want_to_scrape = ['h1', 'h2', 'h3', 'p', 'li']

然后这只给我指定标签内的内容:

soup_each_html = BeautifulSoup(particular_page_content, "html.parser")
inner_content = soup_each_html.find("article", "container")

假设这是结果:

<article class="container">
  <h1>this is headline 1</h1>
  <p>this is paragraph</p>
  <h2>this is headline 2</h2>
  <a href="bla.html">this won't be shown bcs 'a' tag is not in the array</a>
</article>

然后我有方法负责向 file.md 写入一行如果数组中的标签存在于内容中

with open("file.md", 'a+') as f:
    for tag in list_of_tags_you_want_to_scrape:
        inner_content_tag = inner_content.find_all(tag)

        for x in inner_content_tag:
            f.write(str(x))
            f.write("\n")

确实如此。但问题是，它遍历数组(对于每个)并将保存所有 <h1>首先，全部<h2>在第二位，等等。那是因为这是 list_of_tags_you_want_to_scrape 中指定的顺序数组。

结果是这样的:

<article class="container">
  <h1>this is headline 1</h1>
  <h2>this is headline 2</h2>
  <p>this is paragraph</p>
</article>

所以我想让它们像原始 HTML 那样按正确的顺序排列。先后<h1>应该是 <p>元素。

这意味着我可能还需要为每个循环执行 inner_content并检查此 inner_content 中的每一行是否至少等于数组中的一个标签。如果是，则保存，然后移至另一行。我尝试这样做，并为每个 inner_content 逐行获取，但它给了我一个错误，我不确定它是否是正确的方法。 (第一天使用 BeautifulSoup 模块)

关于如何修改我的方法以实现此目的的任何提示或建议？谢谢!

最佳答案

要保持 html 输入的原始顺序，您可以使用递归循环遍历 soup.contents 属性:

from bs4 import BeautifulSoup as soup
def parse(content, to_scrape = ['h1', 'h2', 'h3', 'p', 'li']):
   if content.name in to_scrape:
      yield content
   for i in getattr(content, 'contents', []):
      yield from parse(i)

例子:

html = """   
<html>
  <body>
      <h1>My website</h1>
      <p>This is my first site</p>
      <h2>See a listing of my interests below</h2>
      <ul>
         <li>programming</li>
         <li>math</li>
         <li>physics</li>
      </ul>
      <h3>Thanks for visiting!</h3>
  </body>
</html>
"""

result = list(parse(soup(html, 'html.parser')))

输出:

[<h1>My website</h1>, <p>This is my first site</p>, <h2>See a listing of my interests below</h2>, <li>programming</li>, <li>math</li>, <li>physics</li>, <h3>Thanks for visiting!</h3>]

如您所见，html 的原始顺序保持不变，现在可以写入文件:

with open('file.md', 'w') as f:
   f.write('\n'.join(map(str, result)))

每个 bs4 对象包含一个 name 和 contents 属性，等等。 name 属性是标签名称本身，而 contents 属性存储所有子 HTML。 parse 使用 generator首先检查传递的 bs4 对象是否有属于 to_scrape 列表的标签，如果是，yield 就是那个值。最后，parse 遍历 content 的内容，并在每个元素上调用自身。

关于python - 抓取内容中的标签必须与原始 HTML 文件中的标签顺序相同，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/56332967/

python - 抓取内容中的标签必须与原始 HTML 文件中的标签顺序相同

上一篇：python - 字典列表列表中的调用列表不起作用

下一篇：python - 平滑信号并找到峰值