我正在使用 Selenium、Python 和 Beautiful Soup 抓取一个页面,我想将表格的行输出为逗号分隔值。不幸的是,该页面的 HTML 到处都是。到目前为止,我已经设法通过使用元素的 ID 提取了两列。其余值仅包含在没有标识符(例如 class 或 id)的情况下。以下是结果示例。
<table id="tblResults" style="z-index: 102; left: 18px; width: 956px;
height: 547px" cellspacing="1" width="956" border="0">
<tr style="color:Black;background-color:LightSkyBlue;font-family:Arial;font-weight:normal;font-style:normal;text-decoration:none;">
<td> </td>
<td> </td>
<td>Select</td>
<td><a href="javascript:__doPostBack('ctl00$ContentPlaceHolder1$grdResults$ctl02$ctl00','')" style="color:Black;">T</a></td>
<td><a href="javascript:__doPostBack('ctl00$ContentPlaceHolder1$grdResults$ctl02$ctl01','')" style="color:Black;">Party</a></td>
<td>Opposite Party</td>
<td style="width:50px;"><a href="javascript:__doPostBack('ctl00$ContentPlaceHolder1$grdResults$ctl02$ctl02','')" style="color:Black;">Type</a></td>
<td style="width:100px;"><a href="javascript:__doPostBack('ctl00$ContentPlaceHolder1$grdResults$ctl02$ctl03','')" style="color:Black;">Book-Page</a></td>
<td style="width:70px;"><a href="javascript:__doPostBack('ctl00$ContentPlaceHolder1$grdResults$ctl02$ctl04','')" style="color:Black;">Date</a></td>
<td><a href="javascript:__doPostBack('ctl00$ContentPlaceHolder1$grdResults$ctl02$ctl05','')" style="color:Black;">Town</a></td>
</tr>
<tr style="font-family:Arial;font-size:Smaller;font-weight:normal;font-style:normal;text-decoration:none;">
<td align="left" valign="top" style="font-weight:normal;font-style:normal;text-decoration:none;">
<input type="submit" name="ctl00$ContentPlaceHolder1$grdResults$ctl03$btnView" value="View" id="ContentPlaceHolder1_grdResults_btnView_0" title="Click to view this document" style="width:50px;" />
</td>
<td align="left" valign="top" style="font-weight:normal;font-style:normal;text-decoration:none;">
<input type="submit" name="ctl00$ContentPlaceHolder1$grdResults$ctl03$btnMyDoc" value="My Doc" id="ContentPlaceHolder1_grdResults_btnMyDoc_0" title="Click to add this document to My Documents" style="width:60px;" />
</td>
<td valign="top">
<span title="Click here to select this document"><input id="ContentPlaceHolder1_grdResults_CheckBox1_0" type="checkbox" name="ctl00$ContentPlaceHolder1$grdResults$ctl03$CheckBox1" /></span>
</td>
<td>1</td>
<td>
<span id="ContentPlaceHolder1_grdResults_lblParty1_0" title="Grantors:
ALBERT G MOSES FARM
MOSES ALBERT G
Grantees:
">MOSES ALBERT G</span>
</td>
<td>
<span id="ContentPlaceHolder1_grdResults_lblParty2_0" title="Grantors:
ALBERT G MOSES FARM
MOSES ALBERT G
Grantees:
"></span>
</td>
<td valign="top">MAP</td>
<td valign="top">- </td>
<td valign="top">01/16/1953</td>
<td valign="top">TOWN OF BINGHAMTON</td>
</tr>
<tr style="background-color:Gainsboro;font-family:Arial;font-size:Smaller;font-weight:normal;font-style:normal;text-decoration:none;">
<td align="left" valign="top" style="font-weight:normal;font-style:normal;text-decoration:none;">
<input type="submit" name="ctl00$ContentPlaceHolder1$grdResults$ctl04$btnView" value="View*" id="ContentPlaceHolder1_grdResults_btnView_1" title="Click to view this document" style="width:50px;" />
</td>
<td align="left" valign="top" style="font-weight:normal;font-style:normal;text-decoration:none;">
<input type="submit" name="ctl00$ContentPlaceHolder1$grdResults$ctl04$btnMyDoc" value="My Doc" id="ContentPlaceHolder1_grdResults_btnMyDoc_1" title="Click to add this document to My Documents" style="width:60px;" />
</td>
<td valign="top">
<span title="Click here to select this document"><input id="ContentPlaceHolder1_grdResults_CheckBox1_1" type="checkbox" name="ctl00$ContentPlaceHolder1$grdResults$ctl04$CheckBox1" /></span>
</td>
<td>1</td>
<td>
<span id="ContentPlaceHolder1_grdResults_lblParty1_1" title="Grantors:
MOSS EMMY-IND&GDN
MOSES ALEXANDRA/GDN
Grantees:
GOODRICH MERLE L
GOODRICH CHARITY M
">MOSES ALEXANDRA/GDN</span>
</td>
<td>
<span id="ContentPlaceHolder1_grdResults_lblParty2_1" title="Grantors:
MOSS EMMY-IND&GDN
MOSES ALEXANDRA/GDN
Grantees:
GOODRICH MERLE L
GOODRICH CHARITY M
">GOODRICH MERLE L</span>
</td>
</table>
这是我到目前为止编写的脚本,适用于两列:
import re
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = open('searched.html')
bsObj = BeautifulSoup(html)
myTable = bsObj.findAll("tr",{ "style":re.compile("font-family:Arial;font-size:Smaller;font-weight:normal;font-style:normal;text-decoration:none;")} )
for table_ in myTable:
party = table_.find("span", {"id": re.compile("Party1_*")})
oppositeParty= table_.find("span", {"id": re.compile("Party2_*")})
print(party.get_text()+ "," + oppositeParty.get_text())
我试过如下使用 myTable 的子项:
我的表.children
最佳答案
如果您只想转储内容,应该这样做:
myTable = bsObj.find_element_by_tag_name("table")
for table_ in myTable:
rows = table_.find_elements_by_tag_name("tr")
for row_ in rows:
columns = row_.find_elements_by_tag_name("td")
for column_ in columns:
# print out comma delimited text of columns...
# print the end of your row
如果您真的想抓取特定信息,则需要向我们提供更多关于您最终目标的说明。
关于python - 在 html/css 页面上使用 python 和 BeautifulSoup 时访问表中没有 ID 或类的 <td> 元素,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/38605967/