Python3 用 Pandas 读取 Html 表

标签 python html pandas

这里需要一些帮助。计划提取本站所有统计数据https://lotostats.ro/toate-rezultatele-win-for-life-10-20

我的问题是我无法阅读表格。我无法执行此操作,也无法执行第一页。

有人可以帮忙吗?

import requests
import lxml.html as lh
import pandas as pd
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')

url='https://lotostats.ro/toate-rezultatele-win-for-life-10-20'
#Create a handle, page, to handle the contents of the website
page = requests.get(url)
#Store the contents of the website under doc
doc = lh.fromstring(page.content)
#Parse data that are stored between <tr>..</tr> of HTML
tr_elements = doc.xpath('//tr')

#Create empty list
col=[]
i=0
#For each row, store each first element (header) and an empty list
for t in tr_elements[0]:
    i+=1
    name=t.text_content()
    print ('%d:"%s"'%(i,name))
    col.append((name,[]))

#Since out first row is the header, data is stored on the second row onwards
for j in range(1,len(tr_elements)):
    #T is our j'th row
    T=tr_elements[j]

    #If row is not of size 10, the //tr data is not from our table 
    # if len(T)!=10:
    #     break

    #i is the index of our column
    i=0

    #Iterate through each element of the row
    for t in T.iterchildren():
        data=t.text_content() 
        #Check if row is empty
        if i>0:
        #Convert any numerical value to integers
            try:
                data=int(data)
            except:
                pass
        #Append the data to the empty list of the i'th column
        col[i][1].append(data)
        #Increment i for the next column
        i+=1

Dict={title:column for (title,column) in col}
df=pd.DataFrame(Dict)
df.head()   
print(df)  

最佳答案

数据是动态添加的。您可以在网络选项卡中找到源,返回json

import requests


r = requests.get('https://lotostats.ro/all-rez/win_for_life_10_20?draw=1&columns%5B0%5D%5Bdata%5D=0&columns%5B0%5D%5Bname%5D=&columns%5B0%5D%5Bsearchable%5D=true&columns%5B0%5D%5Borderable%5D=false&columns%5B0%5D%5Bsearch%5D%5Bvalue%5D=&columns%5B0%5D%5Bsearch%5D%5Bregex%5D=false&columns%5B1%5D%5Bdata%5D=1&columns%5B1%5D%5Bname%5D=&columns%5B1%5D%5Bsearchable%5D=true&columns%5B1%5D%5Borderable%5D=false&columns%5B1%5D%5Bsearch%5D%5Bvalue%5D=&columns%5B1%5D%5Bsearch%5D%5Bregex%5D=false&start=0&length=20&search%5Bvalue%5D=&search%5Bregex%5D=false&_=1564996040879').json()

您可以对其进行解码,并可能(调查)删除时间戳部分(或简单地替换为随机数)

import requests

r = requests.get('https://lotostats.ro/all-rez/win_for_life_10_20?draw=1&columns[0][data]=0&columns[0][name]=&columns[0][searchable]=true&columns[0][orderable]=false&columns[0][search][value]=&columns[0][search][regex]=false&columns[1][data]=1&columns[1][name]=&columns[1][searchable]=true&columns[1][orderable]=false&columns[1][search][value]=&columns[1][search][regex]=false&start=0&length=20&search[value]=&search[regex]=false&_=1').json()

查看彩票线:

print(r['data'])

draw 参数似乎与抽奖页面有关,例如第二页:

https://lotostats.ro/all-rez/win_for_life_10_20?draw=2&columns[0][data]=0&columns[0][name]=&columns[0][searchable]=true&columns[0][orderable]=false&columns[0][search][value]=&columns[0][search][regex]=false&columns[1][data]=1&columns[1][name]=&columns[1][searchable]=true&columns[1][orderable]=false&columns[1][search][value]=&columns[1][search][regex]=false&start=20&length=20&search[value]=&search[regex]=false&_=1564996040880

您可以更改长度以检索更多结果。例如,我可以故意加大它的大小以获得所有结果

import requests

r = requests.get('https://lotostats.ro/all-rez/win_for_life_10_20?draw=1&columns[0][data]=0&columns[0][name]=&columns[0][searchable]=true&columns[0][orderable]=false&columns[0][search][value]=&columns[0][search][regex]=false&columns[1][data]=1&columns[1][name]=&columns[1][searchable]=true&columns[1][orderable]=false&columns[1][search][value]=&columns[1][search][regex]=false&start=0&length=100000&search[value]=&search[regex]=false&_=1').json()

print(len(r['data']))

否则,您可以将 length 参数设置为设定值,执行初始请求,并计算总页数 (r['recordsFiltered']) 记录数除以每页结果。

import math

total_results = r['recordsFiltered']
results_per_page = 20
num_pages = math.ceil(total_results/results_per_page)

然后执行循环以获取所有结果(记住更改 draw 参数)。显然,请求越少越好。

关于Python3 用 Pandas 读取 Html 表,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/57355299/

相关文章:

html - href 的样式 anchor

Python/Pandas - 将两列与 NaN 值合并

Python 处理包含对象列表的大型 JSON 文件

python - 在 Pandas 中旋转一个 groupby 对象?

python - 在另一个数组中查找与沿轴的最小值对应的数组

python - 如何抓取最近修改的文件

html - 我自己从另一个网站嵌入一个 div(HTML 和 CSS)

html - 通过 FTP 上传 - html 和 css 不再链接

python - 在 pandas to_timedelta 中使用小时单位

python - 如何在 Python 2.7 中实现带超时的锁