使用 pandas read_html()
时无法正确获取行格式。我正在寻找对方法本身或底层 html(通过 bs4 抓取)的调整以获得所需的输出。
当前输出:
(注意它是 1 行包含两种类型的数据。理想情况下它应该分成 2 行,如下所示)
期望:
复制问题的代码:
import requests
import pandas as pd
from bs4 import BeautifulSoup # alternatively
url = "http://ufcstats.com/fight-details/bb15c0a2911043bd"
df = pd.read_html(url)[-1] # last table
df.columns = [str(i) for i in range(len(df.columns))]
# to get the html via bs4
headers = {
"Access-Control-Allow-Origin": "*",
"Access-Control-Allow-Methods": "GET",
"Access-Control-Allow-Headers": "Content-Type",
"Access-Control-Max-Age": "3600",
"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0",
}
req = requests.get(url, headers)
soup = BeautifulSoup(req.content, "html.parser")
table_html = soup.find_all("table", {"class": "b-fight-details__table"})[-1]
最佳答案
如何(快速)修复beautifulsoup
您可以使用 table
中的标题创建一个 dict
,然后遍历每个 td
以附加存储在p
:
data = {}
header = [x.text.strip() for x in table_html.select('tr th')]
for i,td in enumerate(table_html.select('tr:has(td) td')):
data[header[i]] = [x.text.strip() for x in td.select('p')]
pd.DataFrame.from_dict(data)
例子
import requests
import pandas as pd
from bs4 import BeautifulSoup # alternatively
url = "http://ufcstats.com/fight-details/bb15c0a2911043bd"
# to get the html via bs4
headers = {
"Access-Control-Allow-Origin": "*",
"Access-Control-Allow-Methods": "GET",
"Access-Control-Allow-Headers": "Content-Type",
"Access-Control-Max-Age": "3600",
"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0",
}
req = requests.get(url, headers)
soup = BeautifulSoup(req.content, "html.parser")
table_html = soup.find_all("table", {"class": "b-fight-details__table"})[-1]
data = {}
header = [x.text.strip() for x in table_html.select('tr th')]
for i,td in enumerate(table_html.select('tr:has(td) td')):
data[header[i]] = [x.text.strip() for x in td.select('p')]
pd.DataFrame.from_dict(data)
输出
关于python - 使用 pandas read_html 抓取时将表行分隔为 2,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/70179476/