我想从谷歌搜索中提取描述, 现在我有这段代码:
from urlparse import urlparse, parse_qs
import urllib
from lxml.html import fromstring
from requests import get
url='https://www.google.com/search?q=Gotham'
raw = get(url).text
pg = fromstring(raw)
v=[]
for result in pg.cssselect(".r a"):
url = result.get("href")
if url.startswith("/url?"):
url = parse_qs(urlparse(url).query)['q']
print url[0]
提取与搜索相关的url,如何提取出现在url下的描述?
最佳答案
您可以使用 BeautifulSoup
抓取 Google 搜索描述网站网络抓取库。
要从所有页面收集信息,您可以使用带有 while True
循环的“分页”。 while 循环是一个无限循环,在我们的例子中,退出是出现一个切换到下一页的按钮,即 CSS 选择器“.d6cvqb a[id=pnnext]”:
if soup.select_one('.d6cvqb a[id=pnnext]'):
params["start"] += 10
else:
break
您可以使用 CSS 选择器搜索来查找您需要的所有信息(描述、标题等),这些信息可以使用 SelectorGadget 在页面上轻松识别。 Chrome 扩展程序(如果网站是通过 JavaScript 呈现的,则不一定能完美运行)。
确保您使用的是 request headers user-agent
充当“真实”用户访问。因为默认的 requests
user-agent
是 python-requests
并且网站知道它很可能是发送请求的脚本。 Check what's your user-agent
.
检查 online IDE 中的代码.
from bs4 import BeautifulSoup
import requests, json, lxml
# https://docs.python-requests.org/en/master/user/quickstart/#passing-parameters-in-urls
params = {
"q": "gotham", # query
"hl": "en", # language
"gl": "us", # country of the search, US -> USA
"start": 0, # number page by default up to 0
#"num": 100 # parameter defines the maximum number of results to return.
}
# https://docs.python-requests.org/en/master/user/quickstart/#custom-headers
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36"
}
page_num = 0
website_data = []
while True:
page_num += 1
print(f"page: {page_num}")
html = requests.get("https://www.google.com/search", params=params, headers=headers, timeout=30)
soup = BeautifulSoup(html.text, 'lxml')
for result in soup.select(".tF2Cxc"):
website_name = result.select_one(".yuRUbf a")["href"]
try:
description = result.select_one(".lEBKkf").text
except:
description = None
website_data.append({
"website_name": website_name,
"description": description
})
if soup.select_one('.d6cvqb a[id=pnnext]'):
params["start"] += 10
else:
break
print(json.dumps(website_data, indent=2, ensure_ascii=False))
示例输出:
[
{
"website_name": "https://www.imdb.com/title/tt3749900/",
"description": "The show follows Jim as he cracks strange cases whilst trying to help a young Bruce Wayne solve the mystery of his parents' murder. It seemed each week for a ..."
},
{
"website_name": "https://www.netflix.com/watch/80023082",
"description": "When the key witness in a homicide ends up dead while being held for questioning, Gordon suspects an inside job and seeks details from an old friend."
},
{
"website_name": "https://www.gothamknightsgame.com/",
"description": "Gotham Knights is an open-world, action RPG set in the most dynamic and interactive Gotham City yet. In either solo-play or with one other hero, ..."
},
# ...
]
或者你也可以使用Google Search Engine Results API来自 SerpApi。它是带有免费计划的付费 API。 不同之处在于它将绕过来自 Google 的 block (包括 CAPTCHA),无需创建解析器和维护它。
代码示例:
from serpapi import GoogleSearch
from urllib.parse import urlsplit, parse_qsl
import json, os
params = {
"api_key": os.getenv("API_KEY"), # serpapi key
"engine": "google", # serpapi parser engine
"q": "gotham", # search query
"num": "100" # number of results per page (100 per page in this case)
# other search parameters: https://serpapi.com/search-api#api-parameters
}
search = GoogleSearch(params) # where data extraction happens
organic_results_data = []
page_num = 0
while True:
results = search.get_dict() # JSON -> Python dictionary
page_num += 1
for result in results["organic_results"]:
organic_results_data.append({
"title": result.get("title"),
"snippet": result.get("snippet")
})
if "next_link" in results.get("serpapi_pagination", []):
search.params_dict.update(dict(parse_qsl(urlsplit(results.get("serpapi_pagination").get("next_link")).query)))
else:
break
print(json.dumps(organic_results_data, indent=2, ensure_ascii=False))
输出:
[
{
"title": "Gotham (TV Series 2014–2019) - IMDb",
"snippet": "The show follows Jim as he cracks strange cases whilst trying to help a young Bruce Wayne solve the mystery of his parents' murder. It seemed each week for a ..."
},
{
"title": "Gotham (TV series) - Wikipedia",
"snippet": "Gotham is an American superhero crime drama television series developed by Bruno Heller, produced by Warner Bros. Television and based on characters from ..."
},
# ...
]
关于python - 如何使用 python 在谷歌搜索中提取描述?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/46641941/