python - 使用 BeautifulSoup 迭代 div 表

标签 python web-scraping beautifulsoup

class="tableBody"div 有许多 div 作为子元素。我想获取它的所有 div 子项并获取我在这张图片中突出显示的字符串。

import bs4 as bs
import urllib.request
source = urllib.request.urlopen("https://www.ungm.org/Public/Notice").read()
soup = bs.BeautifulSoup(source,'lxml')

t_body = soup.find("div", class_="tableBody")
t_divs = t_body.find_all("div")

上面的代码返回一个空列表。 enter image description here

我正在努力学习 BS4。如果您能帮助我编写代码,我将不胜感激。

最佳答案

您在页面上看到的数据是通过 JavaScript 动态加载的。您可以使用 requests 模块来模拟它。

例如:

import requests
from bs4 import BeautifulSoup


url = 'https://www.ungm.org/Public/Notice/Search'

payload = {
  "PageIndex": 0,
  "PageSize": 15,
  "Title": "",
  "Description": "",
  "Reference": "",
  "PublishedFrom": "",
  "PublishedTo": "12-Jul-2020",
  "DeadlineFrom": "12-Jul-2020",
  "DeadlineTo": "",
  "Countries": [],
  "Agencies": [],
  "UNSPSCs": [],
  "NoticeTypes": [],
  "SortField": "DatePublished",
  "SortAscending": False,
  "isPicker": False,
  "NoticeTASStatus": [],
  "IsSustainable": False,
  "NoticeDisplayType": None,
  "NoticeSearchTotalLabelId": "noticeSearchTotal",
  "TypeOfCompetitions": []
}

soup = BeautifulSoup( requests.post(url, json=payload).content, 'html.parser' )

for row in soup.select('.tableRow'):
    cells = [cell.get_text(strip=True) for cell in row.select('.tableCell')]
    print(cells[1])
    print('{:<30}{:<15}{:<15}{:<25}{:<45}{:<15}'.format(*cells[2:]))
    print('-'*80)

打印:

Supply and delivery of 78 smartphones
13-Jul-2020 11:00 (GMT 2.00)  11-Jul-2020    FAO            Request for quotation    2020/FRMLW/FRMLW/106096                      Malawi         
--------------------------------------------------------------------------------
Supply of LEGUMES SEEDS for rainfed season
23-Jul-2020 14:00 (GMT 2.00)  11-Jul-2020    FAO            Invitation to bid        2020/FRMLW/FRMLW/106051                      Malawi         
--------------------------------------------------------------------------------
Supply of MAIZE SEEDS for rainfed season
22-Jul-2020 14:00 (GMT 2.00)  11-Jul-2020    FAO            Invitation to bid        2020/FRMLW/FRMLW/106050                      Malawi         
--------------------------------------------------------------------------------
Procurement of Supply and Installation of Outdoor Metal Furniture for Rooftop Terrace at FAO Headquarters in Rome, Italy
10-Aug-2020 12:00 (GMT 2.00)  11-Jul-2020    FAO            Invitation to bid        2020/CSAPC/CSDID/105286                      Italy          
--------------------------------------------------------------------------------
Procurement of Silo for Emergency Project
13-Jul-2020 13:00 (GMT 5.00)  11-Jul-2020    FAO            Invitation to bid        2020/FABGD/FABGD/106145                      Bangladesh     
--------------------------------------------------------------------------------
Procurement of Concentrate Ruminant Feed
13-Jul-2020 13:00 (GMT 5.00)  11-Jul-2020    FAO            Invitation to bid        2020/FABGD/FABGD/106064                      Bangladesh     
--------------------------------------------------------------------------------
Purchase of Waste Collection Vehicles - (Two Tractors)
22-Jul-2020 06:30 (GMT 0.00)  11-Jul-2020    UNOPS          Request for quotation    RFQ/2020/15298                               Sri Lanka      
--------------------------------------------------------------------------------
Procurement of Laboratory Equipment and Material
24-Jul-2020 22:23 (GMT -1.00) 11-Jul-2020    FAO            Invitation to bid        2020/FRGAM/FRGAM/106143                      Gambia         
--------------------------------------------------------------------------------
Compra de chalecos para promotores comunitarios para la Oficina de Unicef Bolivar - LRFQ-2020-9159352
16-Jul-2020 23:59 (GMT -3.00) 11-Jul-2020    UNICEF         Request for proposal     LRFQ-2020-9159352                            Venezuela      
--------------------------------------------------------------------------------
Call for Proposals Quality Based Fixed Budget (CFPFB):
26-Jul-2020 17:00 (GMT 3.00)  11-Jul-2020    UNDP           Request for proposal     UNDP-SYR-RPA-051-20                          Syrian Arab Republic
--------------------------------------------------------------------------------
Innovation and Design Specialist
27-Jul-2020 00:00 (GMT -5.00) 11-Jul-2020    UNDP           Not set                  Innovation and Design Specialist             Turkey         
--------------------------------------------------------------------------------
(RFI) from national and/or international CSOs/NGOs for potential partnership with UNDP and its pooled funding mechanism, the Darfur Community Peace and Stability Fund (DCPSF),
26-Jul-2020 08:00 (GMT -7.00) 11-Jul-2020    UNDP           Request for information  RFI-SDN-20-002                               Sudan          
--------------------------------------------------------------------------------
IRAQ-LRPS-017-2020-9159660 Rehabilitation of 3 water projects at Avrek, Grey Basi and Sarsenk in Duhok
26-Jul-2020 12:00 (GMT 3.00)  11-Jul-2020    UNICEF         Request for proposal     9159660                                      Iraq           
--------------------------------------------------------------------------------
106142 INVITACIÓN A COTIZAR PARA LA ADQUISICIÓN DE FERTILIZANTES, HERRAMIENTAS Y MATERIALES PARA ECA DE CACAO
21-Jul-2020 22:00 (GMT -5.00) 10-Jul-2020    FAO            Request for quotation    2020/FLCOL/FLCOL/106142                      Colombia       
--------------------------------------------------------------------------------
Achat de tablettes, de GPS et batteries rechargeable (206 tablettes, 68 GPS, et 181 pack chargeurs et batteries rechargeables) à livrer sur  Dakar
28-Jul-2020 12:00 (GMT 0.00)  10-Jul-2020    FAO            Invitation to bid        2020/FRSEN/FRSEN/106093                      United Kingdom 
--------------------------------------------------------------------------------

编辑:要获取所有页面,仅过滤掉“阿富汗”国家/地区并保存到 CSV,您可以使用以下示例:

import csv
import requests
from bs4 import BeautifulSoup


url = 'https://www.ungm.org/Public/Notice/Search'

payload = {
  "PageIndex": 0,
  "PageSize": 15,
  "Title": "",
  "Description": "",
  "Reference": "",
  "PublishedFrom": "",
  "PublishedTo": "12-Jul-2020",
  "DeadlineFrom": "12-Jul-2020",
  "DeadlineTo": "",
  "Countries": [],
  "Agencies": [],
  "UNSPSCs": [],
  "NoticeTypes": [],
  "SortField": "DatePublished",
  "SortAscending": False,
  "isPicker": False,
  "NoticeTASStatus": [],
  "IsSustainable": False,
  "NoticeDisplayType": None,
  "NoticeSearchTotalLabelId": "noticeSearchTotal",
  "TypeOfCompetitions": []
}

page, all_data = 0, []
while True:
    print('Page {}...'.format(page))

    payload['PageIndex'] = page
    soup = BeautifulSoup( requests.post(url, json=payload).content, 'html.parser' )
    rows = soup.select('.tableRow')
    if not rows:
        break

    for row in rows:
        cells = [cell.get_text(strip=True) for cell in row.select('.tableCell')]
        print(cells[1])
        print('{:<30}{:<15}{:<15}{:<25}{:<45}{:<15}'.format(*cells[2:]))
        print('-'*80)

        # we are only interested in Afghanistan:
        if 'afghanistan' in cells[7].lower():
            all_data.append([row['data-noticeid'], *cells[1:]])

    page += 1

# write to csv file:
with open('data.csv', 'w', newline='') as csvfile:
    csv_writer = csv.writer(csvfile, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
    for row in all_data:
        csv_writer.writerow(row)

已保存的data.csv(来自 LibreOffice 的屏幕截图):

enter image description here

关于python - 使用 BeautifulSoup 迭代 div 表,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/62857309/

相关文章:

python - 在 intellij 中导入新的 python 模块的最佳方法是什么?

python - 使用 beautifulsoup get_text()

Pythonlogging.basicConfig为处理程序设置不同的级别

python - Ubuntu 在使用 Firefox 的系统启动时运行 python 脚本

python - 从 ClinicalTrials.gov 抓取数据

excel - 如何在使用 iframe 的网页上使用 selenium 和 vba 查找表格?

python - 使用python从html中提取文本

python - 无法在 Python 中下载完整文件

python - 在 python 中使用 SSL 证书访问 protected 网站

Python xml 遍历问题和答案