python - 如何使用 python 在谷歌搜索中提取描述?

标签 python html google-search

我想从谷歌搜索中提取描述, 现在我有这段代码:

from urlparse import urlparse, parse_qs
import urllib

from lxml.html import fromstring
from requests import get


    url='https://www.google.com/search?q=Gotham'
    raw = get(url).text
    pg = fromstring(raw)
    v=[]
    for result in pg.cssselect(".r a"):
      url = result.get("href")
      if url.startswith("/url?"):
         url = parse_qs(urlparse(url).query)['q']
      print url[0]

提取与搜索相关的url,如何提取出现在url下的描述?

最佳答案

您可以使用 BeautifulSoup 抓取 Google 搜索描述网站网络抓取库。

要从所有页面收集信息,您可以使用带有 while True 循环的“分页”。 while 循环是一个无限循环,在我们的例子中,退出是出现一个切换到下一页的按钮,即 CSS 选择器“.d6cvqb a[id=pnnext]”:

if soup.select_one('.d6cvqb a[id=pnnext]'):
        params["start"] += 10
else:
    break

您可以使用 CSS 选择器搜索来查找您需要的所有信息(描述、标题等),这些信息可以使用 SelectorGadget 在页面上轻松识别。 Chrome 扩展程序(如果网站是通过 JavaScript 呈现的,则不一定能完美运行)。

确保您使用的是 request headers user-agent充当“真实”用户访问。因为默认的 requests user-agentpython-requests并且网站知道它很可能是发送请求的脚本。 Check what's your user-agent .

检查 online IDE 中的代码.

from bs4 import BeautifulSoup
import requests, json, lxml

# https://docs.python-requests.org/en/master/user/quickstart/#passing-parameters-in-urls
params = {
    "q": "gotham",       # query
    "hl": "en",          # language
    "gl": "us",          # country of the search, US -> USA
    "start": 0,          # number page by default up to 0
    #"num": 100          # parameter defines the maximum number of results to return.
}

# https://docs.python-requests.org/en/master/user/quickstart/#custom-headers
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36"
}

page_num = 0

website_data = []

while True:
    page_num += 1
    print(f"page: {page_num}")
        
    html = requests.get("https://www.google.com/search", params=params, headers=headers, timeout=30)
    soup = BeautifulSoup(html.text, 'lxml')
    
    for result in soup.select(".tF2Cxc"):
        website_name = result.select_one(".yuRUbf a")["href"]
        try:
          description = result.select_one(".lEBKkf").text
        except:
          description = None
                    
        website_data.append({
              "website_name": website_name,
              "description": description  
        })
      
    if soup.select_one('.d6cvqb a[id=pnnext]'):
        params["start"] += 10
    else:
        break

print(json.dumps(website_data, indent=2, ensure_ascii=False))

示例输出:

[
    {
    "website_name": "https://www.imdb.com/title/tt3749900/",
    "description": "The show follows Jim as he cracks strange cases whilst trying to help a young Bruce Wayne solve the mystery of his parents' murder. It seemed each week for a ..."
  },
  {
    "website_name": "https://www.netflix.com/watch/80023082",
    "description": "When the key witness in a homicide ends up dead while being held for questioning, Gordon suspects an inside job and seeks details from an old friend."
  },
  {
    "website_name": "https://www.gothamknightsgame.com/",
    "description": "Gotham Knights is an open-world, action RPG set in the most dynamic and interactive Gotham City yet. In either solo-play or with one other hero, ..."
  },
  # ...
]

或者你也可以使用Google Search Engine Results API来自 SerpApi。它是带有免费计划的付费 API。 不同之处在于它将绕过来自 Google 的 block (包括 CAPTCHA),无需创建解析器和维护它。

代码示例:

from serpapi import GoogleSearch
from urllib.parse import urlsplit, parse_qsl
import json, os

params = {
  "api_key": os.getenv("API_KEY"), # serpapi key
  "engine": "google",              # serpapi parser engine
  "q": "gotham",                   # search query
  "num": "100"                     # number of results per page (100 per page in this case)
  # other search parameters: https://serpapi.com/search-api#api-parameters
}

search = GoogleSearch(params)      # where data extraction happens

organic_results_data = []
page_num = 0

while True:
    results = search.get_dict()    # JSON -> Python dictionary
    
    page_num += 1
    
    for result in results["organic_results"]:
        organic_results_data.append({
            "title": result.get("title"),
            "snippet": result.get("snippet")   
        })
    
    if "next_link" in results.get("serpapi_pagination", []):
        search.params_dict.update(dict(parse_qsl(urlsplit(results.get("serpapi_pagination").get("next_link")).query)))
    else:
        break
    
print(json.dumps(organic_results_data, indent=2, ensure_ascii=False))

输出:

[
   {
    "title": "Gotham (TV Series 2014–2019) - IMDb",
    "snippet": "The show follows Jim as he cracks strange cases whilst trying to help a young Bruce Wayne solve the mystery of his parents' murder. It seemed each week for a ..."
  },
  {
    "title": "Gotham (TV series) - Wikipedia",
    "snippet": "Gotham is an American superhero crime drama television series developed by Bruno Heller, produced by Warner Bros. Television and based on characters from ..."
  },
  # ...
]

关于python - 如何使用 python 在谷歌搜索中提取描述?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/46641941/

相关文章:

单击按钮上的Javascript随机图像

ruby-on-rails - 我们如何使用 angular-translate 为 AngularJS 网站做 SEO?

python - 带有 TF 后端的 Keras 指标与 tensorflow 指标

python - 比较二进制数据的最快方法?

html - CSS 圆形边框

javascript - AngularJS Google 动态标题仅不显示在主页上

spell-checking - 如何纠正 Google 自定义 API 中的拼写错误

python - GitHub GraphQL API 解析 JSON 时出现问题

python - 如何使用OpenCV python自动检测图像的修剪和裁切部分?

javascript - 根据 html 下拉列表部分分配 javascript 变量值