python - 用 beautifulsoup 进行现场抓取

标签 python web-scraping beautifulsoup urllib

我想抓取一个网站以获取职位描述信息,但我似乎只得到不相关的文本。这是 soup 对象的创建:

url = 'https://www.glassdoor.com/Job/boston-full-stack-engineer-jobs-SRCH_IL.0,6_IC1154532_KO7,26.htm?jl=3188635682&guid=0000016a8432102e99e9b5232325d3d5&pos=102&src=GD_JOB_AD&srs=MY_JOBS&s=58&ao=599212'
req = urllib.request.Request(url, headers={'User-Agent' : "Magic Browser"}) 
soup = bs4.BeautifulSoup(urlopen(req),"html.parser")
divliul=soup.body.findAll(['div','li','ul'])
for i in divliul:
    if i.string is not None:
        print(i.string)

如果您浏览该网站一秒钟,您会发现汤似乎只包含左侧栏中的元素,而没有包含职位描述容器中的任何元素。我认为这可能是一个 urllib 请求问题,但我尝试只下载 html 文件并以这种方式读取它,结果是相似的。 输出:

Jobs
Company Reviews
Company Reviews
Companies near you
 Best Buy Reviews in Boston
 Target Reviews in Boston
 IBM Reviews in Boston
 AT&T Reviews in Boston
 The Home Depot Reviews in Boston
 Walmart Reviews in Boston

 Macy's Reviews in Boston
 Microsoft Reviews in Boston
 Deloitte Reviews in Boston
 Amazon Reviews in Boston
 Bank of America Reviews in Boston
 Wells Fargo Reviews in Boston
Company Culture
 Best Places to Work
 12 Companies That Will Pay You to Travel the World
 7 Types of Companies You Should Never Work For
 20 Companies Hiring for the Best Jobs In America
 How to Become the Candidate Recruiters Can’t Resist
 13 Companies With Enviable Work From Home Options
 New On Glassdoor
Salaries
Interviews
Salary Calculator
Account Settings
Account Settings
Account Settings
Account Settings
empty notification btn
My Profile
Saved Jobs
Email & Alerts
Contributions
My Resumes
Company Follows
Account
Help / Contact Us
Account Settings
Account Settings
Account Settings
empty notification btn
For Employers
For Employers
Unlock Employer Account
Unlock Employer Account
Post a Job
Post a Job
Employer Branding
Job Advertising
Employer Blog
Talk to Sales
 Post Jobs Free
Full Stack Engineer Jobs in Boston, MA
Jobs
Companies
Salaries
Interviews
Full Stack Engineer
EASY APPLY
EASY APPLY
Full Stack Engineer | Noodle.com
EASY APPLY
EASY APPLY
Full Stack Engineer
Hot
Software Engineer
EASY APPLY
EASY APPLY
Senior Software Engineer
EASY APPLY
EASY APPLY
We're Hiring
We're Hiring
Full Stack Engineer
Hot
Software Engineer
Hot
Hot
Full Stack Engineer
We're Hiring
Full Stack Software Engineer
EASY APPLY
EASY APPLY
We're Hiring
We're Hiring
Software Engineer
New
New
Full Stack Engineer
EASY APPLY
EASY APPLY
We're Hiring
We're Hiring
Pre-Sales Engineer / Full-Stack Developer
Top Company
Top Company
Full Stack Software Engineer
Software Engineer
Top Company
Top Company
Associate Software Engineer
Full Stack Software Engineer
Software Engineer
New
New
Mid-level Full Stack Software Engineer (Java/React
EASY APPLY
EASY APPLY
Junior Software Engineer - Infrastructure
Software Engineer
Software Engineer
New
New
Associate Software Engineer
C# Engineer - Full Stack
EASY APPLY
EASY APPLY
Software Engineer, Platform
Software Engineer
EASY APPLY
EASY APPLY
Software Engineer
Associate Software Engineer
Software Engineer
Software Engineer
Software Engineer - Features
EASY APPLY
EASY APPLY
 Page 1 of 81
Previous
1
2
3
4
5
Next
 People Also Searched
 Top Cities for Full Stack Engineer:  
 Top Companies for full stack engineer in Boston, MA:  
 Help / Contact Us
 Terms of Use
 Privacy & Cookies (New)
Copyright © 2008–2019, Glassdoor, Inc. "Glassdoor" and logo are proprietary trademarks of Glassdoor, Inc.
 Email me jobs for:
Create a Job Alert
Your job alert has been created.
Create more job alerts for related jobs with one click:

最佳答案

您可以从该页面中提取一些 ID,并将其连接到一个 url 中,该页面使用该 url 来检索 json,该 json 在您滚动时填充右侧的卡片。处理 json 以提取您想要的任何信息。

查找 url - 当您在左侧向下滚动时,右侧会更新内容,因此我在网络选项卡中寻找与更新相关的事件。当我看到在滚动过程中生成的新 url 时,它看起来像是有共同的字符串和不同的部分,即可能是查询字符串格式。我猜变化的部分来自页面(有些看起来像生成的 ID,我们可以保持静态/忽略——我测试过的基于经验的假设)。我在 html 中寻找我期望的是区分服务器作业的重要标识符,即两组 id。您从网络选项卡中获取在 url 字符串中连接的两个 id 中的任何一个,然后按 Ctrl + F 在页面 HTML 中搜索它们;您将看到这些值(value)的来源。

from bs4 import BeautifulSoup as bs
import requests
import re

results = []
with requests.Session() as s:
    url = 'https://www.glassdoor.co.uk/Job/json/details.htm?pos=&ao={}&s=58&guid=0000016a88f962649d396c5b606d567b&src=GD_JOB_AD&t=SR&extid=1&exst=OL&ist=&ast=OL&vt=w&slr=true&cs=1_1d8f42ad&cb=1557076206569&jobListingId={}&gdToken=uo8hehXn6nNuwhjMyBW14w:3RBFWgOD-0e7hK8o-Fgo0bUtD6jw5wJ3UujVq6L-v0ux9mlLjMxjW8-KF9xsDk41j7I11QHOHgcj9LBoWYaCxg:wAFOqHzOjgAxIGQVmbyibsaECrQO-HWfxb8Ugq-x_tU'
    headers = {'User-Agent' : 'Mozilla/5.0'}
    r = s.get('https://www.glassdoor.co.uk/Job/boston-full-stack-engineer-jobs-SRCH_IL.0,6_IC1154532_KO7,26.htm?jl=3188635682&s=58&pos=102&src=GD_JOB_AD&srs=MY_JOBS&guid=0000016a8432102e99e9b5232325d3d5&ao=599212&countryRedirect=true', headers = headers)
    soup = bs(r.content, 'lxml')
    ids = [item['data-ad-order-id'] for item in soup.select('[data-ad-order-id]')]
    p1 = re.compile(r"jobIds':\[(.*)'segmentType'", re.DOTALL)
    init = p1.findall(r.text)[0]
    p2 = re.compile(r"(\d{10})")
    job_ids = p2.findall(init)
    loop_var = list(zip(ids, job_ids))

    for x, y in loop_var:
        data = s.get(url.format(x,y), headers = headers).json()
        results.append(data)

关于python - 用 beautifulsoup 进行现场抓取,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/55994255/

相关文章:

python - 在 python 中迭代带有美丽汤的行表

python - 用 Beautiful Soup 解析 div 子元素

python - 如何将自己分配给自己的另一个实例

python - 使用 Python (numpy) 实现主题模型

r - 使用 XML 和 Rvest 在 R 中进行网页抓取

ruby - 在 Ruby 中使用 Mechanize 获取表

java - 无法在 Android 上使用 URL 下载图像

python - 如何将用户输入存储为多个条目,然后允许用户搜索他们创建的项目?

python 嵌套列表到字典

python - 使用 beautifulsoup 将标签插入到 html 中,并且值已经存在