python - Beautiful Soup 不返回 html 表的列表

我正在尝试从下页的表格中提取描述、日期和网址:

https://www.fda.gov/safety/recalls-market-withdrawals-safety-alerts

为了使我的代码与其他 20 个 url 保持一致，我需要具有以下逻辑，即查找整个正文，然后循环遍历它以查找适用的数据。

问题是表体为空。

import requests
from bs4 import BeautifulSoup

r = requests.get("https://www.fda.gov/safety/recalls-market-withdrawals-safety-alerts")

c = r.content

soup = BeautifulSoup(c,"html.parser")

all = soup.find_all("tbody") #whole table text THIS IS WHERE THE PROBLEM ORIGINATES

for item in all:
    print(item.find_all("tr").text) #test for tr text i.e. product description
    print(item.find("a")["href"]) #url
    print(item.find_all("td")[0].text) #date (won't work but can't test until tbody returns data

我做错了什么？

提前致谢!

最佳答案

该页面中的表格是使用 JavaScript 从另一个页面动态加载的。 Using the Developer tools in your browser, you can copy that request and use it your code 。然后加载到 pandas 数据框中，就完成了:

import requests
import pandas as pd

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:76.0) Gecko/20100101 Firefox/76.0',
    'Accept': 'application/json, text/javascript, */*; q=0.01',
    'Accept-Language': 'en-US,en;q=0.5',
    'X-Requested-With': 'XMLHttpRequest',
    'Connection': 'keep-alive',
    'Referer': 'https://www.fda.gov/safety/recalls-market-withdrawals-safety-alerts',
    'TE': 'Trailers',
}

params = (
    ('_', '1589124541273'),
)

response = requests.get('https://www.fda.gov/files/api/datatables/static/recalls-market-withdrawals.json', headers=headers, params=params)

response
df = pd.read_json(response.text)

使用标准 pandas 方法，您可以从表中提取目标信息。

另一个选项，在本例中为 is to try to work with the FDA's API.

关于python - Beautiful Soup 不返回 html 表的列表，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/61714267/

python - Beautiful Soup 不返回 html 表的列表

上一篇：reactjs - React 类组件渲染两次

下一篇：sql-server-2016 - 分别获取每条记录的 JSON 格式的 SQL Server 记录详细信息