python - 在 html/css 页面上使用 python 和 BeautifulSoup 时访问表中没有 ID 或类的 <td> 元素

标签 python html css selenium beautifulsoup

我正在使用 Selenium、Python 和 Beautiful Soup 抓取一个页面,我想将表格的行输出为逗号分隔值。不幸的是,该页面的 HTML 到处都是。到目前为止,我已经设法通过使用元素的 ID 提取了两列。其余值仅包含在没有标识符(例如 class 或 id)的情况下。以下是结果示例。

<table id="tblResults" style="z-index: 102; left: 18px; width: 956px; 
   height: 547px" cellspacing="1" width="956" border="0">
   <tr style="color:Black;background-color:LightSkyBlue;font-family:Arial;font-weight:normal;font-style:normal;text-decoration:none;">
      <td>&nbsp;</td>
      <td>&nbsp;</td>
      <td>Select</td>
      <td><a href="javascript:__doPostBack(&#39;ctl00$ContentPlaceHolder1$grdResults$ctl02$ctl00&#39;,&#39;&#39;)" style="color:Black;">T</a></td>
      <td><a href="javascript:__doPostBack(&#39;ctl00$ContentPlaceHolder1$grdResults$ctl02$ctl01&#39;,&#39;&#39;)" style="color:Black;">Party</a></td>
      <td>Opposite Party</td>
      <td style="width:50px;"><a href="javascript:__doPostBack(&#39;ctl00$ContentPlaceHolder1$grdResults$ctl02$ctl02&#39;,&#39;&#39;)" style="color:Black;">Type</a></td>
      <td style="width:100px;"><a href="javascript:__doPostBack(&#39;ctl00$ContentPlaceHolder1$grdResults$ctl02$ctl03&#39;,&#39;&#39;)" style="color:Black;">Book-Page</a></td>
      <td style="width:70px;"><a href="javascript:__doPostBack(&#39;ctl00$ContentPlaceHolder1$grdResults$ctl02$ctl04&#39;,&#39;&#39;)" style="color:Black;">Date</a></td>
      <td><a href="javascript:__doPostBack(&#39;ctl00$ContentPlaceHolder1$grdResults$ctl02$ctl05&#39;,&#39;&#39;)" style="color:Black;">Town</a></td>
   </tr>
   <tr style="font-family:Arial;font-size:Smaller;font-weight:normal;font-style:normal;text-decoration:none;">
      <td align="left" valign="top" style="font-weight:normal;font-style:normal;text-decoration:none;">
         <input type="submit" name="ctl00$ContentPlaceHolder1$grdResults$ctl03$btnView" value="View" id="ContentPlaceHolder1_grdResults_btnView_0" title="Click to view this document" style="width:50px;" />
      </td>
      <td align="left" valign="top" style="font-weight:normal;font-style:normal;text-decoration:none;">
         <input type="submit" name="ctl00$ContentPlaceHolder1$grdResults$ctl03$btnMyDoc" value="My Doc" id="ContentPlaceHolder1_grdResults_btnMyDoc_0" title="Click to add this document to My Documents" style="width:60px;" />
      </td>
      <td valign="top">
         <span title="Click here to select this document"><input id="ContentPlaceHolder1_grdResults_CheckBox1_0" type="checkbox" name="ctl00$ContentPlaceHolder1$grdResults$ctl03$CheckBox1" /></span>
      </td>
      <td>1</td>
      <td>
         <span id="ContentPlaceHolder1_grdResults_lblParty1_0" title="Grantors:
            ALBERT G MOSES FARM
            MOSES ALBERT G
            Grantees:
            ">MOSES ALBERT G</span>
      </td>
      <td>
         <span id="ContentPlaceHolder1_grdResults_lblParty2_0" title="Grantors:
            ALBERT G MOSES FARM
            MOSES ALBERT G
            Grantees:
            "></span>
      </td>
      <td valign="top">MAP</td>
      <td valign="top">- </td>
      <td valign="top">01/16/1953</td>
      <td valign="top">TOWN OF BINGHAMTON</td>
   </tr>
   <tr style="background-color:Gainsboro;font-family:Arial;font-size:Smaller;font-weight:normal;font-style:normal;text-decoration:none;">
      <td align="left" valign="top" style="font-weight:normal;font-style:normal;text-decoration:none;">
         <input type="submit" name="ctl00$ContentPlaceHolder1$grdResults$ctl04$btnView" value="View*" id="ContentPlaceHolder1_grdResults_btnView_1" title="Click to view this document" style="width:50px;" />
      </td>
      <td align="left" valign="top" style="font-weight:normal;font-style:normal;text-decoration:none;">
         <input type="submit" name="ctl00$ContentPlaceHolder1$grdResults$ctl04$btnMyDoc" value="My Doc" id="ContentPlaceHolder1_grdResults_btnMyDoc_1" title="Click to add this document to My Documents" style="width:60px;" />
      </td>
      <td valign="top">
         <span title="Click here to select this document"><input id="ContentPlaceHolder1_grdResults_CheckBox1_1" type="checkbox" name="ctl00$ContentPlaceHolder1$grdResults$ctl04$CheckBox1" /></span>
      </td>
      <td>1</td>
      <td>
         <span id="ContentPlaceHolder1_grdResults_lblParty1_1" title="Grantors:
            MOSS EMMY-IND&amp;GDN
            MOSES ALEXANDRA/GDN
            Grantees:
            GOODRICH MERLE L
            GOODRICH CHARITY M
            ">MOSES ALEXANDRA/GDN</span>
      </td>
      <td>
         <span id="ContentPlaceHolder1_grdResults_lblParty2_1" title="Grantors:
            MOSS EMMY-IND&amp;GDN
            MOSES ALEXANDRA/GDN
            Grantees:
            GOODRICH MERLE L
            GOODRICH CHARITY M
            ">GOODRICH MERLE L</span>
      </td>
</table>

这是我到目前为止编写的脚本,适用于两列:

import re
from urllib.request import urlopen
from bs4 import BeautifulSoup

html = open('searched.html')
bsObj = BeautifulSoup(html)
myTable = bsObj.findAll("tr",{ "style":re.compile("font-family:Arial;font-size:Smaller;font-weight:normal;font-style:normal;text-decoration:none;")} )

 for table_ in myTable:
    party = table_.find("span", {"id": re.compile("Party1_*")})
    oppositeParty= table_.find("span", {"id": re.compile("Party2_*")})
    print(party.get_text()+ "," + oppositeParty.get_text())

我试过如下使用 myTable 的子项:

我的表.children

最佳答案

如果您只想转储内容,应该这样做:

myTable = bsObj.find_element_by_tag_name("table")
for table_ in myTable:
    rows = table_.find_elements_by_tag_name("tr")
    for row_ in rows:
        columns = row_.find_elements_by_tag_name("td")
        for column_ in columns:
            # print out comma delimited text of columns...
        # print the end of your row

如果您真的想抓取特定信息,则需要向我们提供更多关于您最终目标的说明。

关于python - 在 html/css 页面上使用 python 和 BeautifulSoup 时访问表中没有 ID 或类的 <td> 元素,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/38605967/

相关文章:

python - 检查列表中的两个项目是否相同?

javascript - 如何向我的数据表添加列 SUM 功能?

css - 样式选项列表与输入框不一致

javascript - D3.js 中的鼠标悬停问题,包含路径元素并在刷过焦点后更改工具提示数据?

php - 从不同的网页中选择图像

python - 从 python 命令在终端中打印彩色文本

python - CSP 不接受内联脚本哈希或随机数

python - 有没有办法更改函数外部定义的变量或函数

jquery - 将 border 属性设置为 li - html 中的标记会将最后一个 li 元素降到下一行

javascript - 如何使我的 `TextArea` 内容可见 HTML/Javascript