python-3.x - 无法将网络抓取输出作为字典返回

标签 python-3.x dictionary web-scraping beautifulsoup python-requests

所以我正在尝试抓取网站 of its staff roster我希望最终产品是 {staff: position} 格式的字典。我目前坚持将每个员工姓名和职位作为单独的字符串返回。很难清楚地发布输出,但它基本上是在名称列表中,然后是位置。因此,例如,列表中的名字将与第一个位置配对,依此类推。我已经确定每个名称和位置都是一个 class 'bs4.element.Tag。我相信我需要获取名称和位置并列出每一个,然后使用 zip 将元素放入字典中。我已经尝试过实现这个,但到目前为止没有任何效果。通过使用 class_ 参数,我可以获得所需文本的最低限度是包含 p 的单个 div。我仍然没有使用 python 的经验和网络抓取的新手,但我相对精通 html 和 css,所以帮助将不胜感激。

# Simple script attempting to scrape 
# the staff roster off of the 
# Greenville Drive website

import requests
from bs4 import BeautifulSoup

URL = 'https://www.milb.com/greenville/ballpark/frontoffice'

page = requests.get(URL)

soup = BeautifulSoup(page.content, 'html.parser')

staff = soup.find_all('div', class_='l-grid__col l-grid__col--xs-12 l-grid__col--sm-4 l-grid__col--md-3 l-grid__col--lg-3 l-grid__col--xl-3')

for staff in staff:
    data = staff.find('p')
    if data:
        print(data.text.strip())

position = soup.find_all('div', class_='l-grid__col l-grid__col--xs-12 l-grid__col--sm-4 l-grid__col--md-6 l-grid__col--lg-6 l-grid__col--xl-6')

for position in position:
    data = position.find('p')
    if data:
        print(data.text.strip())  

# This code so far provides the needed data, but need it in a dict()

最佳答案

BeautifulSoup 有 find_next()可用于获取具有指定匹配过滤器的下一个标签。找到“staff”div 并使用 find_next() 获取相邻的“position”div

import requests
from bs4 import BeautifulSoup

URL = 'https://www.milb.com/greenville/ballpark/frontoffice'
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
staff_class = 'l-grid__col l-grid__col--xs-12 l-grid__col--sm-4 l-grid__col--md-3 l-grid__col--lg-3 l-grid__col--xl-3'
position_class = 'l-grid__col l-grid__col--xs-12 l-grid__col--sm-4 l-grid__col--md-6 l-grid__col--lg-6 l-grid__col--xl-6'
result = {}

for staff in soup.find_all('div', class_=staff_class):
    data = staff.find('p')
    if data:
        staff_name = data.text.strip()
        postion_div = staff.find_next('div', class_=position_class)
        postion_name = postion_div.text.strip()
        result[staff_name] = postion_name

print(result)

输出

{'Craig Brown': 'Owner/Team President', 'Eric Jarinko': 'General Manager', 'Nate Lipscomb': 'Special Advisor to the President', 'Phil Bargardi': 'Vice President of Sales', 'Jeff Brown': 'Vice President of Marketing', 'Greg Burgess, CSFM': 'Vice President of Operations/Grounds', 'Jordan Smith': 'Vice President of Finance', 'Ned Kennedy': 'Director of Inside Sales', 'Patrick Innes': 'Director of Ticket Operations', 'Micah Gold': 'Senior Account Executive', 'Molly Mains': 'Senior Account Executive', 'Houghton Flanagan': 'Account Executive', 'Jeb Maloney': 'Account Executive', 'Olivia Adams': 'Inside Sales Representative', 'Tyler Melson': 'Inside Sales Representative', 'Toby Sandblom': 'Inside Sales Representative', 'Katie Batista': 'Director of Sponsorships and Community Engagement', 'Matthew Tezza': 'Sponsor Services and Activations Manager', 'Melissa Welch': 'Sponsorship and Community Events Manager', 'Beth Rusch': 'Director of West End Events', 'Kristin Kipper': 'Events Manager', 'Grant Witham': 'Events Manager', 'Alex Guest': 'Director of Game Entertainment & Production', 'Lance Fowler': 'Director of Video Production', 'Davis Simpson': 'Director of Media and Creative Services', 'Cameron White': 'Media Relations Manager', 'Ed Jenson': 'Broadcaster', 'Adam Baird': 'Accountant', 'Mike Agostino': 'Director of Food and Beverage', 'Roger Campana': 'Assistant Director of Food and Beverage', 'Wilbert Sauceda': 'Executive Chef', 'Elise Parish': 'Premium Services Manager', 'Timmy Hinds': 'Director of Facility Operations', 'Zack Pagans': 'Assistant Groundskeeper', 'Amanda Medlin': 'Business and Team Operations Manager', 'Allison Roedell': 'Office Manager'}

关于python-3.x - 无法将网络抓取输出作为字典返回,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/59867297/

相关文章:

python - Requests.get 显示与 Chrome 开发者工具不同的 HTML

来自 dll 的 Java 调用函数

python - 电子邮件不使用 django 1.9 发送

python - 不使用内置函数反转列表

python - heapq.nsmallest 如何工作

c# - 创建基于编译器的非静态版本 "dictionary",其中键是类型

python - 抓取阿拉伯语网站时收到阿拉伯字母中的奇怪字母

python - 如何在继承的 CrawlSpider 中重用基于 scrapy Spider 的蜘蛛的解析方法?

python - 向现有 django 模型添加新的唯一字段时的最佳实践

python - 使用字典替换字符串而不删除 Pandas 数据框中的字符