我使用 lxml 获取下面 html 表标题的值,但是当我尝试使用 xpath 解析 tbody 中 tr 内的 td 的内容时,它给了我空值,因为数据是动态生成的。 下面是我得到的 python 代码及其输出值。 我怎样才能得到这些值?
<table id="datatabl" class="display compact cell-border dataTable no-footer" role="grid" aria-describedby="datatabl_info">
<thead>
<tr role="row">
<th class="dweek sorting_desc" tabindex="0" aria-controls="datatabl" rowspan="1" colspan="1" style="width: 106px;" aria-label="Week: activate to sort column ascending" aria-sort="descending">Week</th>
<th class="dnone sorting" tabindex="0" aria-controls="datatabl" rowspan="1" colspan="1" style="width: 100px;" aria-label="None: activate to sort column ascending">None</th>
</tr>
</thead>
<tbody>
<tr class="odd" role="row">
<td class="sorting_1">2016-05-03</td>
<td>4.27</td>
<td>21.04</td>
</tr>
<tr class="even" role="row">
<td class="sorting_1">2016-04-26</td>
<td>4.24</td>
<td>95.76</td>
<td>21.04</td>
</tr>
</tbody>
我的Python代码
from lxml import etree
import urllib
web = urllib.urlopen("http://droughtmonitor.unl.edu/MapsAndData/DataTables.aspx")
s = web.read()
html = etree.HTML(s)
## Get all 'tr'
tr_nodes = html.xpath('//table[@id="datatabl"]/thead')
print tr_nodes
## 'th' is inside first 'tr'
header = [i[0].text for i in tr_nodes[0].xpath("tr")]
print header
## tbody
tr_nodes_content = html.xpath('//table[@id="datatabl"]/tbody')
print tr_nodes_content
td_content = [[td[0].text for td in tr.xpath('td')] for tr in tr_nodes_content[0]]
print td_content
终端输出:
[<Element thead at 0xb6b250ac>]
['Week']
[<Element tbody at 0xb6ad20cc>]
[]
最佳答案
这将从ajax请求中获取json格式的数据:
import requests
headers = {
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.75 Safari/537.36',
'Content-Type': 'application/json',
'Referer': 'http://droughtmonitor.unl.edu/MapsAndData/DataTables.aspx',
'X-Requested-With': 'XMLHttpRequest',
}
import json
data = json.dumps({'area':'conus', 'type':'conus', 'statstype':'1'})
ajax = requests.post("http://droughtmonitor.unl.edu/Ajax.aspx/ReturnTabularDM",
data=data,
headers=headers)
from pprint import pprint as pp
pp(ajax.json())
输出片段:
{u'd': [{u'D0': 33.89,
u'D1': 14.56,
u'D2': 5.46,
u'D3': 3.44,
u'D4': 1.11,
u'Date': u'2016-05-03',
u'FileDate': u'20160503',
u'None': 66.11,
u'ReleaseID': 890,
u'__type': u'DroughtMonitorData.DmData'},
{u'D0': 39.64,
u'D1': 15.38,
u'D2': 5.89,
u'D3': 3.44,
u'D4': 1.11,
u'Date': u'2016-04-26',
u'FileDate': u'20160426',
u'None': 60.36,
u'ReleaseID': 889,
u'__type': u'DroughtMonitorData.DmData'},
{u'D0': 39.28,
u'D1': 15.44,
u'D2': 5.94,
u'D3': 3.44,
u'D4': 1.11,
u'Date': u'2016-04-19',
u'FileDate': u'20160419',
u'None': 60.72,
u'ReleaseID': 888,
u'__type': u'DroughtMonitorData.DmData'},
{u'D0': 39.2,
u'D1': 17.75,
u'D2': 6.1,
u'D3': 3.76,
u'D4': 1.71,
u'Date': u'2016-04-12',
u'FileDate': u'20160412',
u'None': 60.8,
u'ReleaseID': 887,
u'__type': u'DroughtMonitorData.DmData'},
{u'D0': 37.86,
u'D1': 16.71,
u'D2': 5.95,
u'D3': 3.76,
u'D4': 1.71,
u'Date': u'2016-04-05',
u'FileDate': u'20160405',
u'None': 62.14,
u'ReleaseID': 886,
u'__type': u'DroughtMonitorData.DmData'},
您可以从返回的 json 中获取您想要的所有数据,如果您 print(len(cont.json()["d"]))
您将看到返回 853 行,因此实际上,您似乎一口气获得了 35 页的所有数据。即使您确实解析了页面,您仍然需要再执行 34 次,从 ajax 请求中获取 json 可以轻松解析,并且所有操作都来自单个帖子。
要按州过滤,我们需要将type
设置为state
,将area
设置为CA
:
data = json.dumps({'type':'state', 'statstype':'1','area':'CA'})
ajax = requests.post("http://droughtmonitor.unl.edu/Ajax.aspx/ReturnTabularDM",
data=data,
headers=headers)
from pprint import pprint as pp
pp(ajax.json())
再看一个简短的片段:
{u'd': [{u'D0': 95.73,
u'D1': 89.68,
u'D2': 74.37,
u'D3': 49.15,
u'D4': 21.04,
u'Date': u'2016-05-03',
u'FileDate': u'20160503',
u'None': 4.27,
u'ReleaseID': 890,
u'__type': u'DroughtMonitorData.DmData'},
{u'D0': 95.76,
u'D1': 90.09,
u'D2': 74.37,
u'D3': 49.15,
u'D4': 21.04,
u'Date': u'2016-04-26',
u'FileDate': u'20160426',
u'None': 4.24,
u'ReleaseID': 889,
u'__type': u'DroughtMonitorData.DmData'},
您将看到的内容与页面上显示的内容相匹配。
关于python - 使用lxml在python中获取tr的tbody内的所有td内容,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/37080910/