使用 BS4 提取特定元素时遇到一些问题。 This is taken from the Texas Department of Corrections Executed Inmates page .
I've attached a screenshot for better understanding.
在每个 tr 标签内,有多个 td 标签,其中包含有关名字、姓氏、TDCJ 编号、年龄、日期等的文本。
如何让 BS4 跳过第一个 tr 标签(第一个 tr 标签是列名称)并为每个后续 tr 标签从 td 标签中提取文本?
from urllib.request import urlopen
from bs4 import BeautifulSoup
import csv
def main():
gettabledata()
lstofinmates = list()
def gettabledata():
with urlopen('https://www.tdcj.state.tx.us/death_row/dr_executed_offenders.html') as response:
soup = BeautifulSoup(response, 'html.parser')
with open('exinmates.csv', 'w', newline='') as output_file:
inmate_file_writer = csv.DictWriter(output_file,
fieldnames=['First Name', 'Last Name', 'Execution Number',
'Last Statement', 'TDCJ Number', 'Age', 'Date Executed', 'Race',
'County'],
extrasaction='ignore',
delimiter=',', quotechar='"')
inmate_file_writer.writeheader()
table = soup.find('table').find('tbody')
print (table)
if __name__ == '__main__':
main()
我正在考虑创建 LOD 结构,其中每个字典对应一个犯人信息,td 字段中的文本被插入字典中,每个字典都附加到一个列表中。问题是我找不到跳过第一个 tr 标签的方法以及如何迭代其余 tr 标签以将它们附加到字典中。有什么建议/帮助吗?谢谢!
最佳答案
这里有一些可以帮助您入门的内容:
from bs4 import BeautifulSoup
html = '''<h1>Executed Offenders</h1>
<table class="os" width="100%">
<tbody>
<tr><th scope="col">Execution</th><th scope="col">Link</th><th scope="col">Link</th><th scope="col">Last Name</th><th scope="col">First Name</th><th scope="col">TDCJ Number</th><th scope="col">Age</th><th scope="col">Date</th><th scope="col">Race</th><th scope="col">County</th</tr>
<tr><td>542</td><td><a href="#">Offender Information</a></td><td><a href="#">Last Statement</a></td><td>Bigby</td><td>James</td><td>997</td><td>61</td><td>3/14/2017</td><td>White</td><td>Tarrant</td></tr>
<tr><td>541</td><td><a href="#">Offender Information</a></td><td><a href="#">Last Statement</a></td><td>Ruiz</td><td>Rolando</td><td>999145</td><td>44</td><td>3/07/2017</td><td>Hispanic</td><td>Bexar</td></tr>
<tr><td>540</td><td><a href="#">Offender Information</a></td><td><a href="#">Last Statement</a></td><td>Edwards</td><td>Terry</td><td>999463</td><td>43</td><td>1/26/2017</td><td>Black</td><td>Dallas</td></tr>
<tr><td>539</td><td><a href="#">Offender Information</a></td><td><a href="#">Last Statement</a></td><td>Wilkins</td><td>Christopher</td><td>999533</td><td>48</td><td>01/11/2017</td><td>White</td><td>Tarrant</td></tr>
<tr><td>538</td><td><a href="#">Offender Information</a></td><td><a href="#">Last Statement</a></td><td>Fuller</td><td>Barney</td><td>999481</td><td>58</td><td>10/05/2016</td><td>White</td><td>Houston</td></tr>
</tbody>
</table>'''
soup = BeautifulSoup(html, 'html.parser')
rows = iter(soup.find('table').find_all('tr'))
# skip first row
next(rows)
for row in rows:
for cell in row.find_all('td'):
print(cell)
print()
关于python - 如何使用 BS4 迭代 <td> 标签?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/43096220/