Python - Web 抓取 HTML 表并打印到 CSV

标签 python html csv web-scraping beautifulsoup

我对 Python 还很陌生,但我正在寻找构建一个网络抓取工具,该工具可以在线从 HTML 表中提取数据并以相同的格式将其打印到 CSV 中。

这是 HTML 表格的示例(它很大,所以我只提供几行)。

<div class="col-xs-12 tab-content">
        <div id="historical-data" class="tab-pane active">
          <div class="tab-header">
            <h2 class="pull-left bottom-margin-2x">Historical data for Bitcoin</h2>

            <div class="clear"></div>

            <div class="row">
              <div class="col-md-12">
                <div class="pull-left">
                  <small>Currency in USD</small>
                </div>
                <div id="reportrange" class="pull-right">
                    <i class="glyphicon glyphicon-calendar fa fa-calendar"></i>&nbsp;
                    <span>Aug 16, 2017 - Sep 15, 2017</span> <b class="caret"></b>
                </div>
              </div>
            </div>

            <table class="table">
              <thead>
              <tr>
                <th class="text-left">Date</th>
                <th class="text-right">Open</th>
                <th class="text-right">High</th>
                <th class="text-right">Low</th>
                <th class="text-right">Close</th>
                <th class="text-right">Volume</th>
                <th class="text-right">Market Cap</th>
              </tr>
              </thead>
              <tbody>

                <tr class="text-right">
                  <td class="text-left">Sep 14, 2017</td>
                  <td>3875.37</td>     
                  <td>3920.60</td>
                  <td>3153.86</td>
                  <td>3154.95</td>
                  <td>2,716,310,000</td>
                  <td>64,191,600,000</td>
                </tr>

                <tr class="text-right">
                  <td class="text-left">Sep 13, 2017</td>
                  <td>4131.98</td>     
                  <td>4131.98</td>
                  <td>3789.92</td>
                  <td>3882.59</td>
                  <td>2,219,410,000</td>
                  <td>68,432,200,000</td>
                </tr>

                <tr class="text-right">
                  <td class="text-left">Sep 12, 2017</td>
                  <td>4168.88</td>     
                  <td>4344.65</td>
                  <td>4085.22</td>
                  <td>4130.81</td>
                  <td>1,864,530,000</td>
                  <td>69,033,400,000</td>
                </tr>                
              </tbody>
            </table>
          </div>

        </div>
    </div>

我对使用提供的相同列标题重新创建表格特别感兴趣:“日期”、“开盘价”、“最高价”、“最低价”、“收盘价”、“成交量”、“市值”。目前,我已经能够编写一个简单的脚本,该脚本本质上会转到 URL、下载 HTML、使用 BeautifulSoup 进行解析,然后使用“for”语句来获取 td 元素。下面是我的代码示例(省略 URL)和结果:

from bs4 import BeautifulSoup
import requests
import pandas as pd
import csv

url = "enterURLhere"
page = requests.get(url)
pagetext = page.text

pricetable = {
    "Date" : [],
    "Open" : [],
    "High" : [],
    "Low" : [],
    "Close" : [],
    "Volume" : [],
    "Market Cap" : []
}

soup = BeautifulSoup(pagetext, 'html.parser')

file = open("test.csv", 'w')

for row in soup.find_all('tr'):
    for col in row.find_all('td'):
        print(col.text)

sample output

有人知道如何至少重新格式化拉入表中的数据吗?谢谢。

最佳答案

运行代码,您将从该表中获取所需的数据。要尝试并从这个元素中提取数据,您所需要做的就是将上面粘贴的整个 html 元素包装在 html=''' '''

import csv
from bs4 import BeautifulSoup

outfile = open("table_data.csv","w",newline='')
writer = csv.writer(outfile)

tree = BeautifulSoup(html,"lxml")
table_tag = tree.select("table")[0]
tab_data = [[item.text for item in row_data.select("th,td")]
                for row_data in table_tag.select("tr")]

for data in tab_data:
    writer.writerow(data)
    print(' '.join(data))

我尝试将代码分成几部分以便您理解。我上面所做的是一个嵌套的 for 循环。以下是单独的操作方式:

from bs4 import BeautifulSoup

soup = BeautifulSoup(html,"lxml")
table = soup.find('table')

list_of_rows = []
for row in table.findAll('tr'):
    list_of_cells = []
    for cell in row.findAll(["th","td"]):
        text = cell.text
        list_of_cells.append(text)
    list_of_rows.append(list_of_cells)

for item in list_of_rows:
    print(' '.join(item))

结果:

Date Open High Low Close Volume Market Cap
Sep 14, 2017 3875.37 3920.60 3153.86 3154.95 2,716,310,000 64,191,600,000
Sep 13, 2017 4131.98 3789.92 3882.59 2,219,410,000 68,432,200,000
Sep 12, 2017 4168.88 4344.65 4085.22 4130.81 1,864,530,000 69,033,400,000

关于Python - Web 抓取 HTML 表并打印到 CSV,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/46242664/

相关文章:

python - 通过python将数据推送到谷歌电子表格的最快方法是什么

python - 使用python获取两个日期之间的周数

html - 表行上的 z 索引不起作用

python - 将字节数组转换为字符串spark

java - 如何在上传前将 CSV 文件中的记录与数据库中的记录进行比较

python - 用列表分隔字符串

带有 SKLEARN、PANDAS 和 NUMPY 问题的 Python 部署包?

html - 两个div彼此相邻

javascript - document.getElementsByClassName ('name' ) 的长度为 0

javascript - D3.js 版本 6 : Loading a TSV and changing Variable types