python - 使用 BeautifulSoup 迭代列表

标签 python list dictionary web-scraping beautifulsoup

我正在使用 BeautifulSoup4 构建一个 JSON 格式的列表,其中包含: 来自公共(public) Linkedin 职位搜索的“标题”、“公司”、“位置”、“发布日期”和“链接”,我已经按照我想要的方式进行了格式化,但是它仅列出了页面中的其中一个职位列表,并且希望以相同的格式迭代页面中的每个作业。

例如,我正在努力实现这一目标:

[{'title': 'Job 1', 'company': 'company 1.', 'location': 'sunny side, California', 'date posted': '2 weeks ago', 'link': 'example1.com'}]

[{'title': 'Job 2', 'company': 'company 2.', 'location': 'runny side, California', 'date posted': '2 days ago', 'link': 'example2.com'}]

我尝试将第48、52、56、60和64行从contents.find更改为contents.findAll,但是,它返回所有内容,而不是按照我试图实现的顺序。

from bs4 import BeautifulSoup
import requests

def strip_tags(html):
    s = MLStripper()
    s.feed(html)
    return s.get_data()


def search_website(url):
    # Search HTML Page
    result = requests.get(url)
    content = result.content

soup = BeautifulSoup(content, 'html.parser')

# Job List
jobs = []

for contents in soup.find_all('body'):
    # Title
    title = contents.find('h3', attrs={'class': 'result-card__title ''job-result-card__title'})
    formatted_title = strip_tags(str(title))

    # Company
    company = contents.find('h4', attrs={'class': 'result-card__subtitle job-result-card__subtitle'})
    formatted_company = strip_tags(str(company))

    # Location
    location = contents.find('span', attrs={'class': 'job-result-card__location'})
    formatted_location = strip_tags(str(location))

    # Date Posted
    posted = contents.find('time', attrs={'class': 'job-result-card__listdate'})
    formatted_posted = strip_tags(str(posted))

    # Apply Link
    links = contents.find('a', attrs={'class': 'result-card__full-card-link'})
    formatted_link = (links.get('href'))

    # Add a new compiled job to our dict
    jobs.append({'title': formatted_title,
                 'company': formatted_company,
                 'location': formatted_location,
                 'date posted': formatted_posted,
                 'link': formatted_link
                 })

# Return our jobs
return jobs


link = ("https://www.linkedin.com/jobs/search/currentJobId=1396095018&distance=25&f_E=3%2C4&f_LF=f_AL&geoId=102250832&keywords=software%20engineer&location=Mountain%20View%2C%20California%2C%20United%20States")


print(search_website(link))

我希望输出看起来像

[{'title': 'x', 'company': 'x', 'location': 'x', 'date posted': 'x', 'link': 'x'}] [{'title': 'x', 'company': 'x', 'location': 'x', 'date posted': 'x', 'link': 'x'}] +..

切换到 FindAll 时的输出返回:

[{'title': 'x''x''x''x''x', 'company': 'x''x''x''x''x', 'location': 'x''x''x''x', 'date posted': 'x''x''x''x', 'link': 'x''x''x''x'}]

最佳答案

这是代码的简化版本,但它应该可以帮助您:

result = requests.get('https://www.linkedin.com/jobs/search/?distance=25&f_E=2%2C3&f_JT=F&f_LF=f_AL&geoId=102250832&keywords=software%20engineer&location=Mountain%20View%2C%20California%2C%20United%20States')

soup = bs(result.content, 'html.parser')

# Job List
jobs = []

for contents in soup.find_all('body'):
    # Title
    title = contents.find('h3', attrs={'class': 'result-card__title ''job-result-card__title'})        

    # Company
    company = contents.find('h4', attrs={'class': 'result-card__subtitle job-result-card__subtitle'})        

    # Location
    location = contents.find('span', attrs={'class': 'job-result-card__location'})        

    # Date Posted
    posted = contents.find('time', attrs={'class': 'job-result-card__listdate'})        

    # Apply Link
    link = contents.find('a', attrs={'class': 'result-card__full-card-link'})

    # Add a new compiled job to our dict
    jobs.append({'title': title.text,
                 'company': company.text,
                 'location': location.text,
                 'date posted': posted.text,
                 'link': link.get('href')
                 })

    for job in jobs:
        print(job)

输出:

{'title': '系统软件工程师 - 控制', '公司': 'Blue River Technology', '位置': '加利福尼亚州桑尼维尔', '发布日期': '1 天前', '链接':'https://www.linkedin.com/jobs/view/systems-software-engineer-controls-at-blue-river-technology-1380882942?position=1&pageNum=0&trk=guest_job_search_job-result-card_result-card_full -点击'}

关于python - 使用 BeautifulSoup 迭代列表,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/57229575/

相关文章:

python - 按可以为 None 的属性排序

python - 从 2 个列表中相同索引位置删除多个元素

Python:集合列表中的逗号分隔列表

c++ - 从嵌套 map 中提取值

php - MySQL的英文单词数据库?

arrays - Json 到数组与数组到字典

python - 允许 django-oauth-toolkit 发出 jwt 而不是随机字符串

python - Sqlite3不从数据库读取

python - 抓取用 Javascript 加载的分页数据

python - 值错误 : max() arg is an empty sequence