我想从网站上的表格中提取数据。该表存在于 165 个网页中,我想将其全部抓取。我只能拿到第一页。
我尝试过pandas、beautifulsoup、requests
offset = 0
teacher_list = []
while offset <= 4500:
calls_df, =
pd.read_html("https://projects.newsday.com/databases/long-
island/teacher-administrator-salaries-2017-2018/?offset=0" +
str(offset), header=0, parse_dates=["Start date"])
offset = offset + 1500
print(calls_df)
# calls_df = "https:" + calls_df
collection_page = requests.get(calls_df)
page_html = collection_page.text
soup = BeautifulSoup(page_html, "html.parser")
print(page_html)
print(soup.prettify())
print(teacher_list)
offset = offset + 1500
print(teacher_list,calls_df.to_csv("calls.csv", index=False))
最佳答案
您可以使用步骤参数来增加您的网址
import requests
from bs4 import BeautifulSoup as bs
import pandas as pd
df = pd.DataFrame()
with requests.Session() as s:
for i in range(0, 246001, 1500):
url = 'https://projects.newsday.com/databases/long-%20island/teacher-administrator-salaries-2017-2018/?offset={}'.format(i)
r = s.get(url)
soup = bs(r.content, 'lxml')
dfCurrent = pd.read_html(str(soup.select_one('html')))[0]
dfCurrent.dropna(how='all', inplace = True)
df = pd.concat([df, dfCurrent])
df = df.reset_index(drop=True)
df.to_csv(r"C:\Users\User\Desktop\test.csv", encoding='utf-8-sig')
关于python - 如何使用 pandas 和 beautiful soup 来抓取多个网页地址上的表格?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/55315527/