python - 无法刮

标签 python html beautifulsoup scrape

enter image description here

我正在尝试从 angellist https://angel.co/companies 获取公司列表

我试过这段代码

from bs4 import BeautifulSoup
import urllib2

headers = { 'User-Agent' : 'Mozilla/5.0' }
req = urllib2.Request('https://angel.co/companies', None, headers)
html = urllib2.urlopen(req).read()
soup = BeautifulSoup(html, "html.parser")
p1 = soup.find_all('div' , {"class"," dc59 frw44 _a _jm"})
print p1

但这会返回一个空字符串。

我遇到过类似的问题,有人说更新 beautifulsoup,有人说更改解析器。什么都不适合我。

最佳答案

通过从 https://angel.co/company_filters/search_data 获取参数,您无需 selenium 即可获取所有公司信息 html:

import requests
from bs4 import BeautifulSoup



js = "https://angel.co/company_filters/search_data"

headers = {"X-Requested-With": "XMLHttpRequest",
           "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.75 Safari/537.36"}




u = "https://angel.co/companies/startups?ids%5B%5D={}&total={}&page={}&sort=signal&new=false&hexdigest={}"
with requests.Session() as s:
    params = s.post(js, data={"sort": "signal"}, headers=headers).json()
    companies = s.get(u.format("&ids%5B%5D=".join(map(str, params["ids"])),params["page"] ,params["total"], params["hexdigest"]), headers=headers)
    soup = BeautifulSoup(companies.json()["html"])

您可以在迭代时传递页码以模拟加载更多:

import requests
from bs4 import BeautifulSoup
import time

# post url
js = "https://angel.co/company_filters/search_data"

# X-Requested-With is important
headers = {"X-Requested-With": "XMLHttpRequest",
           "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.75 Safari/537.36"}


# get url
u = "https://angel.co/companies/startups?ids%5B%5D={}&total={}&page={}&sort=signal&new=false&hexdigest={}"


def get_next_pages(js, u, start_page=1):
    with requests.Session() as s:
        params = s.post(js, data={"sort": "signal","page":start_page}, headers=headers).json()
        companies = s.get(
            u.format("&ids%5B%5D=".join(map(str, params["ids"])), params["page"], params["total"], params["hexdigest"]),
            headers=headers)
        soup = BeautifulSoup(companies.json()["html"])
        comps = soup.select("div.company.column")
        yield comps
        while True:
            # increment page count from previous.
            page = params["page"] + 1
            params = s.post(js, data={"sort": "signal", "page": page}, headers=headers).json()
            # keep going until we have reached the maximum queries
            if "ids" not in params:
                break
            companies = s.get(u.format("&ids%5B%5D=".join(map(str, params["ids"])), params["page"], params["total"],
                                       params["hexdigest"]),
                              headers=headers)
            soup = BeautifulSoup(companies.json()["html"])
            comps = soup.select("div.company.column")
            # don't hammer with requests
            time.sleep(.3)
            yield comps

for comps in get_next_pages(js, u):
    print(comps)

如果我们查看开发人员工具的网络输出,我们可以看到当我们加载更多时的发布数据,它一直持续到我们达到限制:

enter image description here

运行上述代码的输出片段:

[<div class="company column">
<div class="g-lockup">
<div class="photo">
<a class="startup-link" data-id="275696" data-type="Startup" href="https://angel.co/dunwello?utm_source=companies" title="Dunwello"><img alt="Dunwello" class="angel_image" src="https://d1qb2nb5cznatu.cloudfront.net/startups/i/275696-99335faecd2fb01467c98d5032f23cf6-thumb_jpg.jpg?buster=1393099676"/></a>
</div>
<div class="text">
<div class="name">
<a class="startup-link" data-id="275696" data-type="Startup" href="https://angel.co/dunwello?utm_source=companies">Dunwello</a>
</div>
<div class="pitch">
Trustworthy recommendations of individual professionals.
</div>
</div>
</div>
</div>, <div class="company column">
<div class="g-lockup">
<div class="photo">
<a class="startup-link" data-id="275832" data-type="Startup" href="https://angel.co/groupahead?utm_source=companies" title="GroupAhead"><img alt="GroupAhead" class="angel_image" src="https://d1qb2nb5cznatu.cloudfront.net/startups/i/275832-3541a563250008bd3f7f9b4d7fe9c33c-thumb_jpg.jpg?buster=1423077576"/></a>
</div>
<div class="text">
<div class="name">
<a class="startup-link" data-id="275832" data-type="Startup" href="https://angel.co/groupahead?utm_source=companies">GroupAhead</a>
</div>
<div class="pitch">
Dedicated apps for groups
</div>
</div>
</div>
</div>, <div class="company column">
<div class="g-lockup">
<div class="photo">
<a class="startup-link" data-id="431492" data-type="Startup" href="https://angel.co/workpop?utm_source=companies" title="Workpop"><img alt="Workpop" class="angel_image" src="https://d1qb2nb5cznatu.cloudfront.net/startups/i/431492-c1b857e30254da60f3847d5358db5c82-thumb_jpg.jpg?buster=1404420060"/></a>
</div>
<div class="text">
<div class="name">
<a class="startup-link" data-id="431492" data-type="Startup" href="https://angel.co/workpop?utm_source=companies">Workpop</a>
</div>
<div class="pitch">
When can you start?
</div>
</div>
</div>
</div>, <div class="company column">
<div class="g-lockup">
<div class="photo">
<a class="startup-link" data-id="446358" data-type="Startup" href="https://angel.co/late-stage-pre-ipo-syndicate?utm_source=companies" title="Late Stage Pre-IPO @ Flight.vc"><img alt="Late Stage Pre-IPO @ Flight.vc" class="angel_image" src="https://d1qb2nb5cznatu.cloudfront.net/startups/i/446358-3511ab7edb5192dad97cbccf2b67ddd7-thumb_jpg.jpg?buster=1428089778"/></a>
</div>
<div class="text">
<div class="name">
<a class="startup-link" data-id="446358" data-type="Startup" href="https://angel.co/late-stage-pre-ipo-syndicate?utm_source=companies">Late Stage Pre-IPO @ Flight.vc</a>
</div>
<div class="pitch">
Syndicated:  Beepi, Zirx, Boost Media, Rent the Runway, Life 360, Scripted
</div>
</div>
</div>
</div>, <div class="company column">
<div class="g-lockup">
<div class="photo">
<a class="startup-link" data-id="450451" data-type="Startup" href="https://angel.co/complex-polygon?utm_source=companies" title="Complex Polygon"><img alt="Complex Polygon" class="angel_image" src="https://d1qb2nb5cznatu.cloudfront.net/startups/i/450451-4f00fd11b2d54533a5bac3cfa72acb1e-thumb_jpg.jpg?buster=1407937645"/></a>
</div>
<div class="text">
<div class="name">
<a class="startup-link" data-id="450451" data-type="Startup" href="https://angel.co/complex-polygon?utm_source=companies">Complex Polygon</a>
</div>
<div class="pitch">
Product studio based in San Francisco, California. 
</div>
</div>
</div>
</div>, <div class="company column">
<div class="g-lockup">
<div class="photo">
<a class="startup-link" data-id="457068" data-type="Startup" href="https://angel.co/21?utm_source=companies" title="21"><img alt="21" class="angel_image" src="https://d1qb2nb5cznatu.cloudfront.net/startups/i/457068-2e7b8c417c3a70aab3026f5f0ca3d8e9-thumb_jpg.jpg?buster=1425975133"/></a>
</div>
<div class="text">
<div class="name">
<a class="startup-link" data-id="457068" data-type="Startup" href="https://angel.co/21?utm_source=companies">21</a>
</div>
<div class="pitch">
A bitcoin miner in every device and in every hand.
</div>
</div>
</div>
</div>, <div class="company column">
<div class="g-lockup">
<div class="photo">
<a class="startup-link" data-id="460720" data-type="Startup" href="https://angel.co/parenthoods?utm_source=companies" title="Parenthoods"><img alt="Parenthoods" class="angel_image" src="https://d1qb2nb5cznatu.cloudfront.net/startups/i/460720-25bc7ca7afd4f7bf0fd7842cafa1bdd1-thumb_jpg.jpg?buster=1425426951"/></a>
</div>
<div class="text">
<div class="name">
<a class="startup-link" data-id="460720" data-type="Startup" href="https://angel.co/parenthoods?utm_source=companies">Parenthoods</a>
</div>
<div class="pitch">
</div>
</div>
</div>
</div>, <div class="company column">
<div class="g-lockup">
<div class="photo">
<a class="startup-link" data-id="462906" data-type="Startup" href="https://angel.co/seed-8?utm_source=companies" title="Seed"><img alt="Seed" class="angel_image" src="https://d1qb2nb5cznatu.cloudfront.net/startups/i/462906-f6b439e20a9d36b9e2d3792da92d160d-thumb_jpg.jpg?buster=1462318689"/></a>
</div>
<div class="text">
<div class="name">
<a class="startup-link" data-id="462906" data-type="Startup" href="https://angel.co/seed-8?utm_source=companies">Seed</a>
</div>
<div class="pitch">
Online Business Banking
</div>
</div>
</div>
</div>, <div class="company column">
<div class="g-lockup">
<div class="photo">
<a class="startup-link" data-id="470102" data-type="Startup" href="https://angel.co/zen99?utm_source=companies" title="Zen99"><img alt="Zen99" class="angel_image" src="https://d1qb2nb5cznatu.cloudfront.net/startups/i/470102-67da791cec4374a1046c53fe99b6f05f-thumb_jpg.jpg?buster=1410560341"/></a>
</div>
<div class="text">
<div class="name">
<a class="startup-link" data-id="470102" data-type="Startup" href="https://angel.co/zen99?utm_source=companies">Zen99</a>
</div>
<div class="pitch">
Finance and insurance tools for freelancers
</div>
</div>
</div>
</div>, <div class="company column">
<div class="g-lockup">
<div class="photo">
<a class="startup-link" data-id="488240" data-type="Startup" href="https://angel.co/maven-ventures-growth-labs?utm_source=companies" title="Maven Ventures Growth Labs"><img alt="Maven Ventures Growth Labs" class="angel_image" src="https://d1qb2nb5cznatu.cloudfront.net/startups/i/488240-d467860829cac8b1a9fbfa2d14e05789-thumb_jpg.jpg?buster=1411577330"/></a>
</div>
<div class="text">
<div class="name">
<a class="startup-link" data-id="488240" data-type="Startup" href="https://angel.co/maven-ventures-growth-labs?utm_source=companies">Maven Ventures Growth Labs</a>
</div>
<div class="pitch">
Get a option to invest up to $500k in the best Maven grads
</div>
</div>
</div>
</div>, <div class="company column">
<div class="g-lockup">
<div class="photo">
<a class="startup-link" data-id="507975" data-type="Startup" href="https://angel.co/skydio?utm_source=companies" title="Skydio"><img alt="Skydio" class="angel_image" src="https://d1qb2nb5cznatu.cloudfront.net/startups/i/507975-aac9786d6c4cba99be634b7bc1969cf3-thumb_jpg.jpg?buster=1420952326"/></a>
</div>
<div class="text">
<div class="name">
<a class="startup-link" data-id="507975" data-type="Startup" href="https://angel.co/skydio?utm_source=companies">Skydio</a>
</div>
<div class="pitch">
MIT, Google[x]ers with deep prior experience doing intelligent navigation for drones
</div>
</div>
</div>
</div>, <div class="company column">
<div class="g-lockup">
<div class="photo">
<a class="startup-link" data-id="517240" data-type="Startup" href="https://angel.co/fin-tech-syndicate?utm_source=companies" title="Fin Tech by Flight.vc"><img alt="Fin Tech by Flight.vc" class="angel_image" src="https://d1qb2nb5cznatu.cloudfront.net/startups/i/517240-5bc50eb42d1e40a8ad437c6bd164a5a8-thumb_jpg.jpg?buster=1414004533"/></a>
</div>
<div class="text">
<div class="name">
<a class="startup-link" data-id="517240" data-type="Startup" href="https://angel.co/fin-tech-syndicate?utm_source=companies">Fin Tech by Flight.vc</a>
</div>
<div class="pitch">
Investing in Financial Services and Fin-Tech that has proprietary advantages
</div>
</div>
</div>
</div>, <div class="company column">
<div class="g-lockup">
<div class="photo">
<a class="startup-link" data-id="521452" data-type="Startup" href="https://angel.co/channel-app?utm_source=companies" title="Channel"><img alt="Channel" class="angel_image" src="https://d1qb2nb5cznatu.cloudfront.net/startups/i/521452-b6bc15ef040fdf37d885aea71ecad3bb-thumb_jpg.jpg?buster=1446676191"/></a>
</div>
<div class="text">
<div class="name">
<a class="startup-link" data-id="521452" data-type="Startup" href="https://angel.co/channel-app?utm_source=companies">Channel</a>
</div>
<div class="pitch">
Watch the world.
</div>
</div>
</div>
</div>, <div class="company column">
<div class="g-lockup">
<div class="photo">
<a class="startup-link" data-id="443932" data-type="Startup" href="https://angel.co/healthsherpa?utm_source=companies" title="HealthSherpa"><img alt="HealthSherpa" class="angel_image" src="https://d1qb2nb5cznatu.cloudfront.net/startups/i/443932-63c6bcbbf9ba36a7fa3e532177222c9b-thumb_jpg.jpg?buster=1462374897"/></a>
</div>
<div class="text">
<div class="name">
<a class="startup-link" data-id="443932" data-type="Startup" href="https://angel.co/healthsherpa?utm_source=companies">HealthSherpa</a>
</div>
<div class="pitch">
Next-generation Healthcare.gov
</div>
</div>
</div>
</div>, <div class="company column">
<div class="g-lockup">
<div class="photo">
<a class="startup-link" data-id="558206" data-type="Startup" href="https://angel.co/sidewire?utm_source=companies" title="Sidewire"><img alt="Sidewire" class="angel_image" src="https://d1qb2nb5cznatu.cloudfront.net/startups/i/558206-b416bf8347c7f766b5ea1cf79123c4d2-thumb_jpg.jpg?buster=1444189112"/></a>
</div>
<div class="text">
<div class="name">
<a class="startup-link" data-id="558206" data-type="Startup" href="https://angel.co/sidewire?utm_source=companies">Sidewire</a>
</div>
<div class="pitch">
Where Experts Chat in Public
</div>
</div>
</div>
</div>, <div class="company column">
<div class="g-lockup">
<div class="photo">
<a class="startup-link" data-id="570055" data-type="Startup" href="https://angel.co/brainchild-1?utm_source=companies" title="Brainchild &amp;amp; Co."><img alt="Brainchild &amp; Co." class="angel_image" src="https://d1qb2nb5cznatu.cloudfront.net/startups/i/570055-cc2c2309fefa21e3ebda6229d6a0b890-thumb_jpg.jpg?buster=1420474118"/></a>
</div>
<div class="text">
<div class="name">
<a class="startup-link" data-id="570055" data-type="Startup" href="https://angel.co/brainchild-1?utm_source=companies">Brainchild &amp; Co.</a>
</div>
<div class="pitch">
Building services and products for consumers
</div>
</div>
</div>
</div>, <div class="company column">
<div class="g-lockup">
<div class="photo">
<a class="startup-link" data-id="571060" data-type="Startup" href="https://angel.co/signatures-capital?utm_source=companies" title="Signatures Capital"><img alt="Signatures Capital" class="angel_image" src="https://d1qb2nb5cznatu.cloudfront.net/startups/i/571060-8a077d7cbac9cc7e2d81859adb8cd1c6-thumb_jpg.jpg?buster=1420664121"/></a>
</div>
<div class="text">
<div class="name">
<a class="startup-link" data-id="571060" data-type="Startup" href="https://angel.co/signatures-capital?utm_source=companies">Signatures Capital</a>
</div>
<div class="pitch">
Supporting founders committed to inventing the future.
</div>
</div>
</div>
</div>, <div class="company column">
<div class="g-lockup">
<div class="photo">
<a class="startup-link" data-id="623000" data-type="Startup" href="https://angel.co/airtable?utm_source=companies" title="Airtable"><img alt="Airtable" class="angel_image" src="https://d1qb2nb5cznatu.cloudfront.net/startups/i/623000-9d210a39051abc7accec1dc686888dcc-thumb_jpg.jpg?buster=1449952044"/></a>
</div>
<div class="text">
<div class="name">
<a class="startup-link" data-id="623000" data-type="Startup" href="https://angel.co/airtable?utm_source=companies">Airtable</a>
</div>
<div class="pitch">
Organize anything you can imagine
</div>
</div>
</div>
</div>, <div class="company column">
<div class="g-lockup">
<div class="photo">
<a class="startup-link" data-id="630861" data-type="Startup" href="https://angel.co/meerkat?utm_source=companies" title="Meerkat"><img alt="Meerkat" class="angel_image" src="https://d1qb2nb5cznatu.cloudfront.net/startups/i/630861-820b9d4af09e110b150c9affe418d860-thumb_jpg.jpg?buster=1425688408"/></a>
</div>
<div class="text">
<div class="name">
<a class="startup-link" data-id="630861" data-type="Startup" href="https://angel.co/meerkat?utm_source=companies">Meerkat</a>
</div>
<div class="pitch">
Live Stream Video.
</div>
</div>
</div>
</div>, <div class="company column">
<div class="g-lockup">
<div class="photo">
<a class="startup-link" data-id="658877" data-type="Startup" href="https://angel.co/flight-vc-syndicate?utm_source=companies" title="Flight Ventures"><img alt="Flight Ventures" class="angel_image" src="https://d1qb2nb5cznatu.cloudfront.net/startups/i/658877-89ccd88502db9d964a651ecba6f86d9d-thumb_jpg.jpg?buster=1457552637"/></a>
</div>
<div class="text">
<div class="name">
<a class="startup-link" data-id="658877" data-type="Startup" href="https://angel.co/flight-vc-syndicate?utm_source=companies">Flight Ventures</a>
</div>
<div class="pitch">
Investing in the Top Companies and Entrepreneurs
</div>
</div>
</div>
</div>]

还有更多过滤器等。如果您想查看如何在浏览器中选择它们并观察如何在网络下的 xhr 选项卡下的 firebug 或开发人员工具中发出请求,您可以进行设置。

关于python - 无法刮,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/37269187/

相关文章:

python - 如何使用(样本、X、Y)的 2D 特征和形状标准化数据?

javascript - Ajax 请求返回 Jquery 中使用的元素

javascript - 使用正则表达式删除所有 html 属性(替换)

javascript ajax post/get 从一个 html-javascript 到另一个

python - 使用 beautifulsoup 从 html 中的 <b> 标签中提取文本

python - 如何使用 BeautifulSoup 停止文章打印两次

python - ImportError : No module named catkin_pkg. 包

python - 如果API URL需要将API key 集成到URL中,如何开发API?

python - 要抓取的网站具有不同的类名

python - Kafka消费者使用python进行轮询消息