python - BeautifulSoup 代码给出了意想不到的结果(已编辑)

标签 python html beautifulsoup

(该问题是根据收到的反馈进行编辑的。我将根据收到的反馈继续进行编辑,直到问题得到解决)

我正在学习 Pyhton,尤其是 beautiful soup,并且我正在使用一组 html 文件进行正则表达式的 Google 练习,其中包含不同年份的流行婴儿名字(例如baby1990.html 等)。如果您有兴趣,可以在这里找到此数据集:https://developers.google.com/edu/python/exercises/baby-names

每个 html 文件都包含一个包含婴儿姓名数据的表格,如下所示:

enter image description here

在婴儿名字表之前还有另一个表。两个表的Tags中的html代码分别如下

<table width="100%" border="0" cellspacing="0" cellpadding="4"> # Unwanted table
<table width="100%" border="0" cellspacing="0" cellpadding="4" summary="formatting">  # targeted table

您可能会发现目标表格与不需要的表格的属性不同:summary="formatting"

第一个表(我们必须跳过的表)具有以下 html 代码:

<table width="100%" border="0" cellspacing="0" cellpadding="4">
  <tbody>
  <tr><td class="sstop" valign="bottom" align="left" width="25%">
      Social Security Online
    </td><td valign="bottom" class="titletext">
      <!-- sitetitle -->Popular Baby Names
    </td>
  </tr>
  <tr bgcolor="#333366"><td colspan="2" height="2"></td></tr>
  <tr><td class="graystars" width="25%" valign="top">
       <a href="../OACT/babynames/">Popular Baby Names</a></td><td valign="top"> 
      <a href="http://www.ssa.gov/"><img src="/templateimages/tinylogo.gif"
      width="52" height="47" align="left"
      alt="SSA logo: link to Social Security home page" border="0"></a><a name="content"></a>
      <h1>Popular Names by Birth Year</h1>September 12, 2007</td>
  </tr>
  <tr bgcolor="#333366"><td colspan="2" height="1"></td></tr>
</tbody></table>

在目标表中,代码如下:

<table width="100%" border="0" cellspacing="0" cellpadding="4" summary="formatting">
<tr valign="top"><td width="25%" class="greycell">
<a href="../OACT/babynames/background.html">Background information</a>
<p><br />
&nbsp; Select another <label for="yob">year of birth</label>?<br />      
<form method="post" action="/cgi-bin/popularnames.cgi">
&nbsp; <input type="text" name="year" id="yob" size="4" value="1990">
<input type="hidden" name="top" value="1000">
<input type="hidden" name="number" value="">
&nbsp; <input type="submit" value="   Go  "></form>
</td><td>
<h3 align="center">Popularity in 1990</h3>
<p align="center">
<table width="48%" border="1" bordercolor="#aaabbb"
 cellpadding="2" cellspacing="0" summary="Popularity for top 1000">
<tr align="center" valign="bottom">
<th scope="col" width="12%" bgcolor="#efefef">Rank</th>
<th scope="col" width="41%" bgcolor="#99ccff">Male name</th>
<th scope="col" bgcolor="pink" width="41%">Female name</th></tr>
<tr align="right"><td>1</td><td>Michael</td><td>Jessica</td> # Targeted row
<tr align="right"><td>2</td><td>Christopher</td><td>Ashley</td> # Targeted row
etc...

可以看到目标行的独特属性是:align = "right"。

现在提取目标单元格内容的代码如下:

with open("C:/Users/ALEX/MyFiles/JUPYTER NOTEBOOKS/google-python-exercises/babynames/baby1990.html","r") \
as f: soup = bs(f.read(), 'html.parser') 

print soup.tr
print "number of elemenents in the soup:" , len(soup)

right_table = soup.find("table", summary = "formatting")

print(right_table.prettify())

print "right_table" , len(right_table)

print(right_table[0].prettify())

for row in right_table[1].find_all("tr", allign = "right"):

     cells = row.find_all("td")

     try:
                            print "cells[0]: " , cells[0]
     except:
                            print "cells[0] : NaN"
     try:
                            print "cells[1]: " , cells[1]
     except:
                            print "cells[1] : NaN"    
     try:
                            print "cells[2]: " , cells[2]
     except:
                            print "cells[2] : NaN"

输出是错误消息:

    <tr><td align="left" class="sstop" valign="bottom" width="25%">
      Social Security Online
    </td><td class="titletext" valign="bottom">
<!-- sitetitle -->Popular Baby Names
    </td>
</tr>
number of elemenents in the soup: 4
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-116-3ec77a65b5ad> in <module>()
      6 right_table = soup.find("table", summary = "formatting")
      7 
----> 8 print(right_table.prettify())
      9 
     10 print "right_table" , len(right_table)

C:\users\alex\Anaconda2\lib\site-packages\bs4\element.pyc in prettify(self, encoding, formatter)
   1198     def prettify(self, encoding=None, formatter="minimal"):
   1199         if encoding is None:
-> 1200             return self.decode(True, formatter=formatter)
   1201         else:
   1202             return self.encode(encoding, True, formatter=formatter)

C:\users\alex\Anaconda2\lib\site-packages\bs4\element.pyc in decode(self, indent_level, eventual_encoding, formatter)
   1164             indent_contents = None
   1165         contents = self.decode_contents(
-> 1166             indent_contents, eventual_encoding, formatter)
   1167 
   1168         if self.hidden:

C:\users\alex\Anaconda2\lib\site-packages\bs4\element.pyc in decode_contents(self, indent_level, eventual_encoding, formatter)
   1233             elif isinstance(c, Tag):
   1234                 s.append(c.decode(indent_level, eventual_encoding,
-> 1235                                   formatter))
   1236             if text and indent_level and not self.name == 'pre':
   1237                 text = text.strip()

... last 2 frames repeated, from the frame below ...

C:\users\alex\Anaconda2\lib\site-packages\bs4\element.pyc in decode(self, indent_level, eventual_encoding, formatter)
   1164             indent_contents = None
   1165         contents = self.decode_contents(
-> 1166             indent_contents, eventual_encoding, formatter)
   1167 
   1168         if self.hidden:

RuntimeError: maximum recursion depth exceeded while calling a Python object

问题如下:

  1. 既然我们已经传递了参数summary =“formatting”,为什么代码会返回第一个表(不需要的表)?

  2. 错误消息意味着什么?为什么要创建它?

  3. 您可以在代码中观察到哪些其他错误(如果有)?

我们将不胜感激您的建议。

最佳答案

summary_ = "formatting"
allign_ = "right"

删除_,只有class__

It’s very useful to search for a tag that has a certain CSS class, but the name of the CSS attribute, “class”, is a reserved word in Python. Using class as a keyword argument will give you a syntax error. As of Beautiful Soup 4.1.2, you can search by CSS class using the keyword argument class_

with open('/home/li/Downloads/google-python-exercises/babynames/baby2006.html') as f:
    soup = bs4.BeautifulSoup(f, 'lxml')
    table = soup.find(summary="Popularity for top 1000")
    for tr in table.find_all('tr'):
        tds = list(tr.stripped_strings)
        print(tds)

输出:

['Rank', 'Male name', 'Female name']
['1', 'Jacob', 'Emily']
['2', 'Michael', 'Emma']
['3', 'Joshua', 'Madison']
['4', 'Ethan', 'Isabella']
['5', 'Matthew', 'Ava']
['6', 'Daniel', 'Abigail']
['7', 'Christopher', 'Olivia']
['8', 'Andrew', 'Hannah']
['9', 'Anthony', 'Sophia']
['10', 'William', 'Samantha']
['11', 'Joseph', 'Elizabeth']

关于python - BeautifulSoup 代码给出了意想不到的结果(已编辑),我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/41599600/

相关文章:

当顺序更改时,Python swap 会导致意外结果

python - 在 python 系列中的哪个月找到 Item wise max sales?

javascript - 如何将数据从 jsp 页面中的表单 onclick 发送到 servlet?

javascript - this.style.borderTop JS 问题

使用 IMAP 的 Python Outlook 电子邮件识别和阅读链接

python - beautifulsoup - TypeError : sequence item 0: expected string, 找到标签

python - 使用 BeautifulSoup 迭代 div 表

python - 使用 GAE 批量 uploader 脚本,如何处理 CSV 中的空值?

python - 我的python代码要花费8个多小时才能遍历大数据

javascript - 当用户单击按钮(在光标位置)时,将一串文本添加到输入字段中