python - BeautifulSoup - <em> 给我的结果带来麻烦

标签 python web-scraping beautifulsoup

我正在尝试将标题放入 <strong>标签进入 headerList以及 infoList 中的其余信息。它适用于除 <em> 之外的所有内容。标签。我知道我知道,HTML 很糟糕,但我没成功。无论如何,这是我正在使用的 HTML:

<table border="0">
<tbody>
<tr>
<td>
<p><strong>MIKE ALSTOTT</strong></p>
<p><strong>Inducted: </strong>June 25, 2014 in West Lafayette, IN</p>
<p><strong>Date of Birth: </strong>December 21, 1973 in Joliet, IL</p>
<p><strong>High School Attended: </strong>Joliet Catholic Academy         <strong>Graduated: </strong>1992</p>
<p><strong>High School Honors: </strong><em>Parade </em>All-American; <em>Chicago Sun-Times </em>Illinois Player of the Year honors; rushed for 2,100 yards and 31 TDs as a senior; led team to 14-0 record and Class 4A State Championship as a junior with 1,820 yards and 26 TDs; also lettered in baseball.</p>
<p><strong>College Attended: </strong>Purdue University                       <strong>Graduated: </strong>1996</p>
<p><strong>College Honors:  </strong>4-year starting fullback; team MVP last 3 years; Purdue's all-time leading rusher with 3,635 yards, 5.6 yards per carry; holds PU record for career TDs with 42 and all-time, all-purpose yardage leader; holds several single season records; rushed for 100 yards or more 16 times; only PU player to accumulate more than 2,500 yards rushing and 1,000 yards receiving; as a senior, finished 11th in Heisman Trophy balloting, First Team All-Big Ten, and Gannett All-American.</p>
<p><strong>Professional Athletic Background:  </strong>Drafted 35th by NFL Tampa Bay Buccaneers 1996 and played there 12 seasons; forced to retire on January 24, 2008, due to neck injuries .</p>
<p><strong>Professional Athletic Honors:  </strong>Buccaneers won Super Bowl XXXVII in 2003; after being named 2nd team All-Pro in 1996, became first offensive player in Bucs' team history to be named 1st team Associated Press All-Pro 1997; selected All-Pro fullback 6 times; holds franchise record of 71 TDs; ran for over 5,000 yards in NFL career.</p>
<p><strong>Special Recognition:  </strong>Since retiring, has worked in private business in St. Petersburg area; established the Mike Alstott Family Foundation that supports the Children's Cancer Center, Ronald McDonald House, St. Petersburg All Children's Hospital, Sally House, and Big Brothers/Big Sisters in the St. Petersburg area; inducted into Purdue Athletics Hall of Fame 2006.</p>
<p><strong>Family:  </strong>Wife, Nicole; children, Griffin, Hannah, and Lexie.</p>
</td>
<td valign="top"><img src="/images/alstott_mike2%207-14.jpg" alt="" width="178" height="249" /></td>
</tr>
</tbody>
</table>

到目前为止,这是我的 Python:

for strong_tag in soup.find_all('strong'):
    headers = strong_tag.text.replace(':', '').replace('\xa0', ' ').strip()

    info = strong_tag.next_sibling

    headerList.append(headers)
    infoList.append(info)

print(headerList)
print(infoList)

这是我得到的结果,我需要帮助解决。问题在于 Parade,因为它没有捕获以下信息之后的其余信息:

['MIKE ALSTOTT', 'Inducted', 'Date of Birth', 'High School Attended', 'Graduated', 'High School Honors', 'College Attended', 'Graduated', 'College Honors', 'Professional Athletic Background', 'Professional Athletic Honors', 'Special Recognition', 'Family']
[None, 'June 25, 2014 in West Lafayette, IN', 'December 21, 1973 in Joliet, IL', 'Joliet Catholic Academy\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0 ', '1992', <em>Parade </em>, 'Purdue University\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0', '1996', "4-year starting fullback; team MVP last 3 years; Purdue's all-time leading rusher with 3,635 yards, 5.6 yards per carry; holds PU record for career TDs with 42 and all-time, all-purpose yardage leader; holds several single season records; rushed for 100 yards or more 16 times; only PU player to accumulate more than 2,500 yards rushing and 1,000 yards receiving; as a senior, finished 11th in Heisman Trophy balloting, First Team All-Big Ten, and Gannett All-American.", 'Drafted 35th by NFL Tampa Bay Buccaneers 1996 and played there 12 seasons; forced to retire on January 24, 2008,\xa0due to neck injuries .', "Buccaneers won Super Bowl XXXVII in 2003; after being named 2nd team All-Pro in 1996, became first offensive player in Bucs' team history to be named 1st\xa0team Associated Press All-Pro 1997; selected All-Pro fullback 6 times; holds franchise record of 71 TDs; ran for over 5,000 yards in NFL career.", "Since retiring, has worked\xa0in private business in St. Petersburg\xa0area; established the Mike Alstott Family Foundation that supports the Children's Cancer Center, Ronald McDonald House, St. Petersburg All Children's Hospital, Sally House, and Big Brothers/Big Sisters in the St. Petersburg area; inducted into Purdue Athletics Hall of Fame 2006.", 'Wife, Nicole; children, Griffin, Hannah, and Lexie.']

最佳答案

试试这个:

from bs4 import BeautifulSoup, Tag

for strong_tag in soup.find_all('strong'):
    headers = strong_tag.text.replace(':', '').replace('\xa0', ' ').strip()

    info = ' '.join([i if not isinstance(i,Tag) else i.text for i in strong_tag.next_siblings])

    headerList.append(headers)
    infoList.append(info)

print(headerList)
print(infoList)

关于python - BeautifulSoup - <em> 给我的结果带来麻烦,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/66214184/

相关文章:

python - 使用 Python/urllib/beautifulsoup 从 URL 批量下载文本和图像?

python - Beautiful soup 返回一个 'NoneType' 对象,我该如何解决这个问题?

python - 在 bs4 中使用 .text 时未获取 json

python - list() 函数在 Python 中有什么作用?

python xpath 返回空列表 - exilead

excel - 从 IE 的下拉列表中选择一个项目

c# - CsQuery 解析 li 项的集合

python - Tensorflow - 训练后检索训练后的前馈神经网络的权重/偏差

python - 我如何从 QListWidgetItem 获取文本

python - 使用 rpy2 将 R 对象转换为 Python 对象