python - 需要接受隐私政策才能访问页面

标签 python web-scraping python-requests

我正在尝试从 AllRecipes.co.uk 抓取信息,但是当代码运行时,我没有被定向到预期的页面,而是被定向到一个要求我事先接受隐私政策的封面。这意味着我无法从我想要的页面上抓取,因为我访问的任何页面都带有此接受隐私政策封面

网站是 AllRecipes.co.uk

import requests
from bs4 import BeautifulSoup
import time
from selenium import webdriver
import numpy as np
import os


userAgent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36'
headers = {
        'user-agent': userAgent
    }

dishType = "main-recipes"
url = 'http://allrecipes.co.uk/recipes/' + dishType + '.aspx?page='
#endPage = 1259
endPage = 3
for i in range(2, endPage):
    delays = [5, 7, 9, 11, 13, 15]
    delay = np.random.choice(delays)
    time.sleep(delay)
    print("Getting request " + str(i))
    r = requests.get(url + str(i))
    soup = BeautifulSoup(r.content, "html.parser")
    print(soup)
    #names = soup.findAll('div', attrs = {'class' : "col-sm-7"})
    #for name in names:
    #    print(name)

最佳答案

您只需要设置 euConsentId cookie:

In [1]: import requests

In [2]: from bs4 import BeautifulSoup

In [3]: url = "http://allrecipes.co.uk/recipes/main-recipes.aspx?page=2"

In [4]: BeautifulSoup(requests.get(url).content, "html.parser").title.get_text()
Out[4]: 'About your privacy on this site'

In [5]: import uuid

In [6]: BeautifulSoup(requests.get(url, cookies={'euConsentId': str(uuid.uuid4())}).content, "html.parser").title.get_text()
Out[6]: 'Main course recipes - All recipes UK '

为了在您的代码中进行调整,我将实例化一个 "session"并在那里设置 cookie:

import uuid4

consent_id = str(uuid.uuid4())
with requests.Session() as session:
    session.cookies = {'euConsentId': consent_id}

    response = session.get(...)

关于python - 需要接受隐私政策才能访问页面,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/53935093/

相关文章:

python - Scrapy爬取stackoverflow匹配多个标签的问题

python - 用于获取 <p> 内所有数据的 Xpath 表达式

python - 如何在 while 循环中处理意外的 json 响应

python - 如何禁用 Requests 库中的日志消息?

python - 为什么使用 ~ 返回 -2 而不是 false?

python - 使用 Python 3 读取 CSV 文件

php - 抓取一个需要 cookie 的站点

python - 使用python收到错误后如何继续请求库功能?

python - SQLAlchemy 多态加载

python - 从 xgb.train() 获取概率