python - 使用 BeautifulSoup 抓取 greatschools.org 返回空列表

标签 python beautifulsoup

我一直在学习如何使用 BeautifulSoup 抓取 greatschools.org 网站。尽管在这里和其他地方寻找不同的解决方案,但我还是陷入了死胡同。 通过使用 chrome 上的“检查”功能,我可以看到该网站有表格标签,但 find_all('tr') 或 find_all('table') 或 find_all('tbody') 返回一个空列表。我错过了什么?

这是我正在使用的代码块:

import requests
from bs4 import BeautifulSoup

url = "https://www.greatschools.org/pennsylvania/bethlehem/schools/? 
tableView=Overview&view=table"
page_response = requests.get(url)
content = BeautifulSoup(page_response.text,"html.parser")

table=content.find_all('table')
table

输出是:[]

在此先感谢您的帮助。

最佳答案

你可以使用Selenium因为它看起来像页面是动态的。如果愿意,您仍然可以使用 beautifulsoup 进行解析。当涉及到标签作为表格时,我选择使用 pandas 来读取 html。您必须做一些拆分文本/列的工作,以及不在第一列中的内容,这应该不会太难。)

让我知道这是否适合您。

import pandas as pd
from selenium import webdriver

url = "https://www.greatschools.org/pennsylvania/bethlehem/schools/?tableView=Overview&view=table"

driver = webdriver.Chrome('C:\chromedriver_win32\chromedriver.exe')
driver.get(url)

html = driver.page_source

table = pd.read_html(html)
df = table[0]

driver.close()

输出

print (table[0])
                                               School                       ...                                                              District
0   9/10Above averageSouthern Lehigh Intermediate ...                       ...                                       Southern Lehigh School District
1   8/10Above averageHanover El School3890 Jackson...                       ...                                        Bethlehem Area School District
2   8/10Above averageLehigh Valley Charter High Sc...                       ...                        Lehigh Valley Charter High School For The Arts
3   6/10AverageCalypso El School1021 Calypso Ave, ...                       ...                                        Bethlehem Area School District
4   6/10AverageMiller Heights El School3605 Allen ...                       ...                                        Bethlehem Area School District
5   6/10AverageAsa Packer El School1650 Kenwood Dr...                       ...                                        Bethlehem Area School District
6   6/10AverageLehigh Valley Academy Regional Cs15...                       ...                                     Lehigh Valley Academy Regional Cs
7   5/10AverageNortheast Middle School1170 Fernwoo...                       ...                                        Bethlehem Area School District
8   5/10AverageNitschmann Middle School1002 West U...                       ...                                        Bethlehem Area School District
9   5/10AverageThomas Jefferson El School404 East ...                       ...                                        Bethlehem Area School District
10  4/10Below averageJames Buchanan El School1621 ...                       ...                                        Bethlehem Area School District
11  4/10Below averageLincoln El School1260 Gresham...                       ...                                        Bethlehem Area School District
12  4/10Below averageGovernor Wolf El School1920 B...                       ...                                        Bethlehem Area School District
13  4/10Below averageSpring Garden El School901 No...                       ...                                        Bethlehem Area School District
14  4/10Below averageClearview El School2121 Abing...                       ...                                        Bethlehem Area School District
15  4/10Below averageLiberty High School1115 Linde...                       ...                                        Bethlehem Area School District
16  4/10Below averageEast Hills Middle School2005 ...                       ...                                        Bethlehem Area School District
17  4/10Below averageFreedom High School3149 Chest...                       ...                                        Bethlehem Area School District
18  3/10Below averageMarvine El School1425 Livings...                       ...                                        Bethlehem Area School District
19  3/10Below averageWilliam Penn El School1002 Ma...                       ...                                        Bethlehem Area School District
20  3/10Below averageLehigh Valley Dual Language C...                       ...                            Lehigh Valley Dual Language Charter School
21  2/10Below averageBroughal Middle School114 Wes...                       ...                                        Bethlehem Area School District
22  2/10Below averageDonegan El School1210 East 4t...                       ...                                        Bethlehem Area School District
23  2/10Below averageFountain Hill El School1330 C...                       ...                                        Bethlehem Area School District
24  Currently unratedSt. Anne School375 Hickory St...                       ...                                                                   NaN

[25 rows x 7 columns]

现在,如果您仍然想使用 BeautifulSoup,因为您可能还试图提取其中一些链接或表格中的其他标签(也许仅仅获取表格不足以满足您的需求? ),您可以像往常一样在获得 page_response 后继续使用 bs4。

from bs4 import BeautifulSoup
from selenium import webdriver

url = "https://www.greatschools.org/pennsylvania/bethlehem/schools/?tableView=Overview&view=table"

driver = webdriver.Chrome('C:\chromedriver_win32\chromedriver.exe')
driver.get(url)

page_response = driver.page_source

content = BeautifulSoup(page_response,'html.parser')  
table=content.find_all('table')
table

driver.close()

关于python - 使用 BeautifulSoup 抓取 greatschools.org 返回空列表,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/54023608/

相关文章:

python - 为什么函数内的这个 DataFrame 修改会改变全局外部函数?

python - 从 matlab2015a 调用 python

python仅在满足要求时才创建新文件夹

python - AWS 胶水 : Failed to start job run due to missing metadata

python - 如何使用 Python 从字符串中提取一些信息?

javascript - 用于网页抓取的 Selenium 与 BeautifulSoup

python - Soup.select,只返回第一个结果

python - 我的 Django ModelForm 是未绑定(bind)的吗?

Python (bs4 + selenium) - 使用selenium模拟一些 Action 后搜索html数据

python - 查找同级元素的文本,其中原始元素与特定字符串匹配