我正在尝试从下页的表格中提取描述、日期和网址:
https://www.fda.gov/safety/recalls-market-withdrawals-safety-alerts
为了使我的代码与其他 20 个 url 保持一致,我需要具有以下逻辑,即查找整个正文,然后循环遍历它以查找适用的数据。
问题是表体为空。
import requests
from bs4 import BeautifulSoup
r = requests.get("https://www.fda.gov/safety/recalls-market-withdrawals-safety-alerts")
c = r.content
soup = BeautifulSoup(c,"html.parser")
all = soup.find_all("tbody") #whole table text THIS IS WHERE THE PROBLEM ORIGINATES
for item in all:
print(item.find_all("tr").text) #test for tr text i.e. product description
print(item.find("a")["href"]) #url
print(item.find_all("td")[0].text) #date (won't work but can't test until tbody returns data
我做错了什么?
提前致谢!
最佳答案
该页面中的表格是使用 JavaScript 从另一个页面动态加载的。 Using the Developer tools in your browser, you can copy that request and use it your code 。然后加载到 pandas 数据框中,就完成了:
import requests
import pandas as pd
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:76.0) Gecko/20100101 Firefox/76.0',
'Accept': 'application/json, text/javascript, */*; q=0.01',
'Accept-Language': 'en-US,en;q=0.5',
'X-Requested-With': 'XMLHttpRequest',
'Connection': 'keep-alive',
'Referer': 'https://www.fda.gov/safety/recalls-market-withdrawals-safety-alerts',
'TE': 'Trailers',
}
params = (
('_', '1589124541273'),
)
response = requests.get('https://www.fda.gov/files/api/datatables/static/recalls-market-withdrawals.json', headers=headers, params=params)
response
df = pd.read_json(response.text)
使用标准 pandas 方法,您可以从表中提取目标信息。
另一个选项,在本例中为 is to try to work with the FDA's API.
关于python - Beautiful Soup 不返回 html 表的列表,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/61714267/