Python BeautifulSoup,遍历标签和属性

标签 python html selenium beautifulsoup tags

我想遍历 html 页面特定部分中的所有标记。我应用了 BeautifulSoup,但没有它我也可以生活,只有 Selenium 库。 假设我有以下 html 代码:

<table id="myBSTable">   
    <tr>
        <th>Column A1</th>
        <th>Column B1</th>
        <th>Column C1</th>
        <th>Column D1</th>
        <th>Column E1</th>
    </tr>
    <tr>
        <td data="First Column Data"></td>
        <td data="Second Column Data"></td>
        <td title="Title of the First Row">Value of Row 1</td>
        <td>Beautiful 1</td>
        <td>Soup 1</td>
    </tr>
    <tr>
        <td></td>
        <td data-g="Second Column Data"></td>
        <td title="Title of the Second Row">Value of Row 2</td>
        <td>Selenium 1</td>
        <td>Rocks 1</td>
    </tr>
    <tr>
        <td></td>
        <td></td>
        <td title="Title of the Third Row">Value of Row 3</td>
        <td>Pyhon 1</td>
        <td>Boulder 1</td>
    </tr>
    <tr>
        <th>Column A2</th>
        <th>Column B2</th>
        <th>Column C2</th>
        <th>Column D2</th>
        <th>Column E2</th>
    </tr>
    <tr>
        <td data="First Column Data"></td>
        <td data="Second Column Data"></td>
        <td title="Title of the First Row">Value of Row 1</td>
        <td>Beautiful 2</td>
        <td>Soup 2</td>
    </tr>
    <tr>
        <td></td>
        <td data-g="Second Column Data"></td>
        <td title="Title of the Second Row">Value of Row 2</td>
        <td>Selenium 2</td>
        <td>Rocks 2</td>
    </tr>
    <tr>
        <td></td>
        <td></td>
        <td title="Title of the Third Row">Value of Row 3 2</td>
        <td>Pyhon 2</td>
        <td>Boulder 2</td>
    </tr>
</table>  

我让这部分工作得很好:

#Selenium libraries
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.common.exceptions import NoSuchElementException

#BeautifulSoup
from bs4 import BeautifulSoup

browser = webdriver.Firefox()
browser.get('http://urltoget.com')   

table = browser.find_element_by_id('myBSTable')
bs_table = BeautifulSoup(table.get_attribute('innerHTML'), 'lxml')
#So far so good
rows = bs_table.findAll('tr')
for tr in rows:
    #Here is where I need help
    #I want to iterate through all tags
    #but I don't know if is going to be a th or a td
    #At the same time I need to do something
    #if is a td or a th

这就是我想要完成的:

    #The following is a pseudo code
    for col in tr.tags:
        print col.name, col.value
        for attribute in col.attrs:
            print "    ", attribute.name, attribute.value
    #End pseudo code

谢谢, 附庸风雅

最佳答案

您可以通过指定要查找的标签列表来定位 tdth。为了获取所有元素属性,请使用 .attrs attribute :

rows = bs_table.find_all('tr')
for row in rows:
    cells = row.find_all(['td', 'th'])
    for cell in cells:
        print(cell.name, cell.attrs)

关于Python BeautifulSoup,遍历标签和属性,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/44723713/

相关文章:

python - 如何使用OpenCV裁剪圆形图像?

html - 在 Bootstrap 3 col 上填充似乎无法删除

javascript - jQuery .stop() 中断后续动画

java - 当我们在 selenium 中有 Actions 类时,为什么我们需要 Robot 类

python - 无法调整 matplotlib 窗口大小

python - 如何为 "label for"设置正确的 xpath

php - 在不使用 WHERE 的情况下显示用户的特定记录

java - 如何找到<ul>特定帧的 "href"?

java - 使用普通的 Selenium WebDriver 实例

oop - 制作可在 lambda 中使用的属性的最 Pythonic 方式是什么?