python - 如何使用 BeautifulSoup 处理特定标签中的不同格式

我希望能够单独处理 HTML 文件中的某些标签。我的代码对于除两个标签之外的所有标签都工作正常(到目前为止)。这两行各有两行，而不是一行。这是我的代码:

from bs4 import BeautifulSoup

with open("F:/gpu.txt") as f:
    soup = BeautifulSoup(f)
    section = soup.find_all("td")
    #print(section[2])
    for section in section:
        if section.parent(text="GPU Name:"):
            print(section.text)
        elif section.parent(text="GPU Variant:"):
            print (section.text)
        elif section.parent(text="Bus Interface:"):
            print (section.text)
        elif section.parent(text="Transistors:"):
            print (section.text)

事情还在继续。然而，当我们说到“进程大小:”时，html 代码是不同的:

        <th>Process Size:</th>
      <td>
        Something 
                <br />
                Something Else
              </td>
    </tr>

对于所有其他情况，就像:

      <th>GPU Name:</th>
      <td>BLABLA</td>
    </tr>
        <tr>
      <th>GPU Variant:</th>
      <td>BLABLA</td>
    </tr>
        <tr>
      <th>Bus Interface:</th>
      <td>BLABLA</td>
    </tr>
    <tr>
      <th>Transistors:</th>
      <td>BLABLA</td>
    </tr>

因此，当我运行脚本时，我得到以下结果:

BLABLA
BLABLA

        Something 

                Something Else

BLABLA
BLABLA

我需要的是能够单独使用“Something”和“Something Else”(并且没有那些白线和空格)和/或使其仅是一件事，将其转换为字符串，例如:“某事/别的事”。

抱歉，如果我的信息不够清楚，英语不是我的母语。谢谢!

最佳答案

您可以找到节内的所有文本节点(使用 text=True )并使用 / 连接它们:

print('/'.join(item.strip() for item in section.find_all(text=True)))

示例:

from bs4 import BeautifulSoup

data = """
<table>
    <tr>
      <th>GPU Name:</th>
      <td>BLABLA</td>
    </tr>
        <tr>
      <th>GPU Variant:</th>
      <td>BLABLA</td>
    </tr>
        <tr>
      <th>Process Size: </th>
      <td>BLABLA</td>
    </tr>
    <tr>
      <th>Transistors:</th>
      <td>BLABLA</td>
    </tr>
    <tr>
      <th>Process Size:</th>
      <td>
        Something
                <br />
                Something Else
              </td>
    </tr>
</table>
"""

soup = BeautifulSoup(data)
section = soup.find_all("td")

for section in section:
    if section.parent(text="GPU Name:"):
        print(section.text)
    elif section.parent(text="GPU Variant:"):
        print (section.text)
    elif section.parent(text="Process Size:"):
        print ('/'.join(item.strip() for item in section.find_all(text=True)))
    elif section.parent(text="Transistors:"):
        print (section.text)

打印:

BLABLA
BLABLA
BLABLA
Something/Something Else

关于python - 如何使用 BeautifulSoup 处理特定标签中的不同格式，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/25417776/

python - 如何使用 BeautifulSoup 处理特定标签中的不同格式

上一篇：python - Django:为什么当前 URL 不匹配 urls.py 中的任何模式

下一篇：python - 返回由字符串元素及其长度组成的对列表 - Python