python - 无法使用请求抓取 graphql 页面

标签 python python-3.x web-scraping beautifulsoup graphql

我正在尝试使用请求模块从网页中抓取公司名称及其相应的链接。

虽然内容是高度动态的,但我注意到它们在 window.props 旁边的大括号内可用。

所以,我想挖出那部分并使用 json 处理它,但我看到 \u0022 字符而不是引号 "。这就是我的意思:

{\u0022firms\u0022: [{\u0022index\u0022: 1, \u0022slug\u0022: \u0022zjjz\u002Datelier\u0022, \u0022name\u0022:

我试过:

import re
import json
import requests
from bs4 import BeautifulSoup

link = 'https://architizer.com/firms/'

with requests.Session() as s:
    s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.104 Safari/537.36'
    r = s.get(link)
    items = re.findall(r'window.props[^"]+(.*?);',r.text)[0].strip('"').replace('\u0022', '\'')
    print(items)

How can I scrape the name and link of different firms traversing multiple pages from that web-page using requests?

最佳答案

嗯,那很有趣。

您正在处理由 GraphQL 提供支持的页面, 所以你必须正确地模仿请求。

此外,他们还希望您发送一个 Referer Header 以及一个 csfr token 。这可以很容易地从初始 HTML 中挖出并在后续请求中重复使用。

这是我的看法:

import time

import requests
from bs4 import BeautifulSoup

link = 'https://architizer.com/firms/'
query = """{ allFirmsWithProjects( first: 6, after: "6", firmType: "Architecture / Design Firm", firmName: "All Firm Names", projectType: "All Project Types", projectLocation: "All Project Locations", firmLocation: "All Firm Locations", orderBy: "recently-featured", affiliationSlug: "", ) { firms: edges { cursor node { index id: firmId slug: firmSlug name: firmName projectsCount: firmProjectsCount lastProjectDate: firmLastProjectDate media: firmLogoUrl projects { edges { node { slug: slug media: heroUrl mediaId: heroId isHiddenFromListings } } } } } pageInfo { hasNextPage endCursor } totalCount } }"""


def query_graphql(page_number: int = 6) -> dict:
    q = query.replace(f'after: "6"', f'after: "{str(page_number)}"')
    return s.post(
        "https://architizer.com/api/v3.0/graphql",
        json={"query": q},
    ).json()


def has_next_page(graphql_response: dict) -> bool:
    return graphql_response["data"]["allFirmsWithProjects"]["pageInfo"]["hasNextPage"]


def get_next_page(graphql_response: dict) -> int:
    return graphql_response["data"]["allFirmsWithProjects"]["pageInfo"]["endCursor"]


def get_firms_data(graphql_response: dict) -> list:
    return graphql_response["data"]["allFirmsWithProjects"]["firms"]


def parse_firms_data(firms: list) -> str:
    return "\n".join(firm["node"]["name"] for firm in firms)


def wait_a_bit(wait_for: float = 1.5):
    time.sleep(wait_for)


with requests.Session() as s:
    s.headers["user-agent"] = "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.104 Safari/537.36"
    s.headers["referer"] = "https://architizer.com/firms/"

    csrf_token = BeautifulSoup(
        s.get(link).text, "html.parser"
    ).find("input", {"name": "csrfmiddlewaretoken"})["value"]

    s.headers.update({"x-csrftoken": csrf_token})

    response = query_graphql()
    while True:
        if not has_next_page(response):
            break
        print(parse_firms_data(get_firms_data(response)))
        wait_a_bit()
        response = query_graphql(get_next_page(response))

为了示例,这应该输出公司名称:

Brooks + Scarpa Architects
Studio Saxe
NiMa Design
Best Practice Architecture
Gensler
Inca Hernandez
kaa studio
Taller Sintesis
Coryn Kempster and Julia Jamrozik
Franklin Azzi Architecture
Wittman Estes
Masfernandez Arquitectos
MATIAS LOPEZ LLOVET
SRG Partnership, Inc.
GANA Arquitectura
Meyer & Associates Architects, Urban Designers
Steyn Studio
BGLA architecture | urban design

and so on ...

关于python - 无法使用请求抓取 graphql 页面,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/67294232/

相关文章:

python - 如何使顶层以外的区域不可点击?

python - 如何阻止 's' 在我的循环中重复出现两次?

python - 如何解析 div 并获取不同行中的每个 <strong> 标签内容?

python - 如何模拟 SMBConnection 类以返回模拟连接对象

python - 如何使用 pandas DataFrame 计算列表字典?

javascript - 如何正确使用Xpath通过scrapy抓取AJAX数据?

python - 获取文本并删除所有标签,但保留标题和粗体的标签

python - 您可以对 Azure 函数或存储队列的输出进行速率限制吗?

python - Python 3 的时间复杂度

python - 如何用 Python 搜索相似的列?