python - 从缩减表中抓取数据

标签 python beautifulsoup

使用 Beautiful Soup 并将我的网络源数据隔离在“p”标签内，我设法检索了我需要的数据。现在，我想迭代变量“表”内的剩余数据(每行和每个单元格)，以将数据抓取到列表中。谁能帮助我如何实现这一目标？我已经阅读了其他几篇文章，但无法将其应用于我的具体问题...谢谢。

from bs4 import BeautifulSoup
import urllib2
url = "http://www.gks.ru/bgd/free/B00_25/IssWWW.exe/Stg/d000/000715.HTM"
page = urllib2.urlopen(url)
soup = BeautifulSoup(page.read(), 'html.parser')
table=soup.findAll('p',text=True)
print(table)

最佳答案

假设您想要获取每月的价格数据，您需要在 table 中查找所有 tr 元素并跳过前 3 个(标题行)。请注意， html.parser 对我不起作用，但 lxml 对我有用(请参阅 Differences between parsers ):

soup = BeautifulSoup(page, 'lxml')  # requires 'lxml' to be installed

table = soup.find("center").find("table")
for row in table.find_all("tr")[3:]:
    cells = [cell.get_text(strip=True) for cell in row.find_all("td")]
    print(cells)

打印:

['January', '469,4', '15,0', '3,9']
['February', '479,8', '16,7', '2,2']
['March', '485,6', '16,9', '1,2']
['April', '487,8', '16,4', '0,5']
['May', '489,5', '15,8', '0,4']
['June', '490,5', '15,3', '0,2']
['July', '494,4', '15,6', '0,8']
['August', '496,1', '15,8', '0,4']
['September', '499,0', '15,7', '0,6']
['October', '502,7', '15,6', '0,7']
['November', '506,4', '15,0', '0,8']
['December', '', '', '']

关于python - 从缩减表中抓取数据，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/34822633/

上一篇：python - 无法在 virtualenv 中安装较旧的 wxPython(无法解压文件/无法确定存档格式)

下一篇：python - 如何使用 Python 从 html 类中抓取链接

python - 我无法在数据框中添加两列

python - BR 内的文本无法使用 python beautifulsoup 获取

Python re.search 首先按照结果

python - Django 多文件下载

python - Django 不会通过自定义域提供来自 Amazon S3 的静态文件

python - 如何根据python中的另一个变量计算非零出现次数？

python - 非负矩阵分解无法收敛

python - 我无法从 URL 获取文本 (BeautifulSoup)

python - 在完全平坦的 HTML 层次结构上使用 BeautifulSoup