python - 如何用Python选择网页的特定表格

我是编程和Python的新手。但我想在我的 python 脚本中解析 HTML。

这是网页: http://stock.finance.sina.com.cn/hkstock/finance/00759.html

问题 1:

此页面有关特定股票的财务信息。这四个表大约是:

财务摘要，
Assets 负债表，
现金流
损益表。

我想提取表 3 和表 4 中的信息。这是我的代码:

import urllib
from bs4 import BeautifulSoup

url = 'http://stock.finance.sina.com.cn/hkstock/finance/00759.html'

html = urllib.urlopen(url).read()   #.read() mean read all into a string
soup = BeautifulSoup(html, "lxml")

table = soup.find("table", { "class" : "tab05" })
for row in table.findAll("tr"):
    print row.findAll("td")

但是这段代码只能获取第一张表的信息。如何更改代码才能获取第三个和第四个表信息？我发现这4个表不包含唯一的id或类名，我不知道如何找到它们....

问题2:

这也是简体中文网页，如何在输出时保留原文？

问题3:

每个表格的右上角都有一个下拉菜单，用于选择相应的时间段，即:“全部”、“全年”、< strong>“半年”、“第一季度”和“第三季度”。 urllib 能够更改此下拉菜单吗？

非常感谢。

最佳答案

根据该网站，所有四个表的类名都是“tab05”。

因此，您所要做的就是简单地将 var soup 处的 .find 方法更改为 .findAll，然后所有四个表可以访问。

import urllib
from bs4 import BeautifulSoup

url = 'http://stock.finance.sina.com.cn/hkstock/finance/00759.html'
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html, "lxml")

tables = soup.findAll("table", { "class" : "tab05" })
print len(tables) #4

for table in tables:
    for row in table.findAll("tr"):
        for col in row.findAll("td"):
            print col.getText()

对于简体中文的编码，print col.getText()会在终端上得到正确的文字。如果您正在寻求将它们写入文件，则必须将字符串编码为 gb2312。

f.write(col.getText().encode('gb2312'))

对于第三个问题，由于数据是由datatable.js中编写的javascript函数渲染的，我认为仅通过urllib不可能获取所有数据。最好查看其他一些库以找到合适的用法。

关于python - 如何用Python选择网页的特定表格，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/34431833/

python - 如何用Python选择网页的特定表格

上一篇：带有 cookie 保存的 Python 登录脚本

下一篇：python - 在使用 python 请求下载之前获取 mp4 文件大小