python - BeautifulSoup 断言错误

我正在尝试将此网站抓取到 .CSV 中，但收到一条错误消息:AssertionError: 已传递 9 列，传递的数据有 30 列。代码如下，有点乱，因为我是从 Jupyter Notebook 导出的。

from urllib.request import Request, urlopen
from bs4 import BeautifulSoup as soup
import pandas as pd

url = 'https://apps.azsos.gov/apps/election/cfs/search/CandidateSearch.aspx'

req = Request(url , headers={'User-Agent': 'Mozilla/5.0'})
html = urlopen(req).read()
soup = BeautifulSoup(html)

type(soup)  # we see that soup is a BeautifulSoup object

column_headers = [th.getText() for th in 
                  soup.findAll('tr', limit=2)[1].findAll('th')]
column_headers # our column headers

data_rows = soup.findAll('th')[2:]  # skip the first 2 header rows

type(data_rows)  # now we have a list of table rows

candidate_data = [[td.getText() for td in data_rows[i].findAll('td')]
            for i in range(len(data_rows))]

df = pd.DataFrame(candidate_data, columns=column_headers)
df.head()  # head() lets us see the 1st 5 rows of our DataFrame by default

df.to_csv(r'C:/Dev/Sheets/Candiate_Search.csv', encoding='utf-8', index=False)

最佳答案

页面上的数据[ screenshot of given url 1] 肯定有一个表，您可以解析出列标题并将它们传递到您的 CSV。从视觉上看，该表有 8 列，但您解析了 9 个标题。此时，您可能应该检查您的数据，看看您发现了什么 - 它可能不是您所期望的。但是好吧，你去检查一下，你会发现其中一个是表中的间隔列，它将是空的或垃圾，然后你继续。

这些行:

data_rows = soup.findAll('th')[2:]  # skip the first 2 header rows

type(data_rows)  # now we have a list of table rows

candidate_data = [[td.getText() for td in data_rows[i].findAll('td')]
        for i in range(len(data_rows))]

查找每个 <th>页面中的实例，然后每个 <td>每个里面<th> ，这就是它真正脱轨的地方。我猜您不是网络开发人员，但表格及其子元素(行又名 <tr> ，标题又名 <th> ，单元格又名 <td> )在大多数页面上用于组织大量的视觉元素和有时也用于组织表格数据。

你猜怎么着？您发现很多表格不是这个可视表格，因为您在整个页面中搜索 <th>元素。

我建议您预先过滤掉整个 soup首先找到 <table>或<div>仅包含您感兴趣的表格数据，然后在该范围内进行搜索。

关于python - BeautifulSoup 断言错误，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/59634423/

python - BeautifulSoup 断言错误

上一篇：python - 如何通过 for 循环中的 for 循环将列表 append 到数据帧

下一篇：python - 使用 slug 字段找不到反向