Python:BeautifulSoup 从 div 类中提取所有 h1 文本

标签 python web-scraping beautifulsoup

from requests import get
from bs4 import BeautifulSoup

res = get('https://www.ceda.com.au/Events/Upcoming-events')
soup = BeautifulSoup(res.text,"lxml")


event_location = '\n'.join([' '.join(item.find_parent().select("span")[0].text.split()) for item in soup.select(".side-list .icon-map-marker")])
print(event_location)


event_date = '\n'.join([' '.join(item.find_parent().select("span")[0].text.split()) for item in soup.select(".side-list .icon-calendar")])
print(event_date)


event_name = '\n'.join([' '.join(item.find_parent().select("class")[0].text.split()) for item in soup.select(".event-detail-bx .h1")])
print(event_name)

我试图从网站中提取事件日期、地点事件名称,我成功地获取了事件日期、事件超链接和地点信息。

但是我未能提取事件名称信息,有人可以帮助我提取所有事件名称和每个事件的超链接吗?

最佳答案

我想你想试一试,以稍微有条理的方式获取所有数据:

import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup

url = 'https://www.ceda.com.au/Events/Upcoming-events'
res = requests.get(url)
soup = BeautifulSoup(res.text,"lxml")

for items in soup.select(".list-bx"):
    event_name = ''.join([item.text for item in items.select(".event-detail-bx a h1")])
    event_links = urljoin(url,''.join([item['href'] for item in items.select(".event-detail-bx a")]))
    speaker_info = items.select(".sub-content-txt h3")[0].next_sibling.strip()
    event_date = ''.join([' '.join(item.find_parent().select("span")[0].text.split()) for item in items.select(".icon-calendar")])
    event_location = ''.join([' '.join(item.find_parent().select("span")[0].text.split()) for item in items.select(".icon-map-marker")])     
    print("Name: {}\nLink: {}\nSpeaker: {}\nDate: {}\nLocation: {}\n".format(event_name,event_links,speaker_info,event_date,event_location))

部分输出:

Name: 2018 Trustee welcome back
Link: https://www.ceda.com.au/Events/Library/Q180124
Speaker: Melinda Cilento, Chief Executive, CEDA
Date: 24/01/2018
Location: Brisbane Convention and Exhibition Centre

Name: NSW Trustee welcome back 2018
Link: https://www.ceda.com.au/Events/Library/N180130
Speaker: Luke Foley MP, NSW Opposition Leader, Parliament of NSW
Date: 30/01/2018
Location: Shangri-La Hotel

关于Python:BeautifulSoup 从 div 类中提取所有 h1 文本,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/47961214/

相关文章:

python - 应该定义密集层输入的最后一个维度。没有发现。收到完整的输入形状 : <unknown>

连接到使用 Nodejs 创建的 socket.io 服务器的 Python 脚本

javascript - 如何用js提取网页中当前所有的视频文件及其地址?

python - 在没有 Chrome GUI 的情况下抓取 JS 呈现的网站?

python - 查找最近索引值的最快方法

c# - 如何使用 IE/.Net/C# 进行真正的多线程网络挖掘?

c# - 如何阻止通过 HttpWebRequest 访问 Web 应用程序?

python - bs4 文档有什么问题?我无法运行 unwrap() 示例代码

python - BeautifulSoup 在Python中提取没有类的值

python - 如何计算 Tensorflow 中的所有二阶导数(仅 Hessian 矩阵的对角线)?