python - 使用python 3从网页抓取数据，需要先登录

我检查了this question但它只有一个答案，而且有点超出我的理解范围(刚开始使用 Python)。我正在使用 Python 3。

我正在尝试从 this page 抓取数据，但如果你有 BP 帐户，该页面就会有很大不同/更有用。在 BeautifulSoup 为我获取数据之前，我需要程序让我登录。

到目前为止我已经

from bs4 import BeautifulSoup
import urllib.request 
import requests

username = 'myUsername'
password = 'myPassword'

from requests import session

payload = {'action': 'Log in',
       'Username: ': username,
       'Password: ': password}

# the next 3 lines are pretty much copied from a different StackOverflow
# question. I don't really understand what they're doing, and obviously these 
# are where the problem is.

with session() as c:
    c.post('https://www.baseballprospectus.com/manageprofile.php', data=payload)
    response = c.get('http://www.baseballprospectus.com/sortable/index.php?cid=1820315')

soup = BeautifulSoup(response.content, "lxml")

for row in soup.find_all('tr')[7:]:
    cells = row.find_all('td')
    name = cells[1].text
    print(name)

该脚本确实有效，它只是在登录之前从网站中提取数据，所以这不是我想要的数据。

最佳答案

从概念上讲，您的代码没有问题。您使用 session 对象发送登录请求，然后使用同一 session 发送所需页面的请求。这意味着登录请求设置的 cookie 应保留用于第二个请求。如果您想了解有关 Session 对象工作原理的更多信息，请参阅相关的 Requests documentation .

由于我没有棒球招股说明书的有效登录信息，我不得不猜测您发送到登录页面的数据有问题。使用 Chrome 开发者工具中的“网络”选项卡进行快速检查，显示登录页面 manageprofile.php 接受四个 POST 参数:

username: myUsername
password: myPassword
action: muffinklezmer
nocache: some long number, e.g. 2417395155

但是，您要发送一组不同的参数，并为“action”参数指定不同的值。请注意，参数名称必须与原始请求完全匹配，否则manageprofile.php将不接受登录。

尝试用此版本替换有效负载字典:

payload = {
       'action': 'muffinklezmer',
       'username': username,
       'password': password}

如果这不起作用，请尝试添加“nocache”参数，例如:

'nocache': '1437955145'

关于python - 使用python 3从网页抓取数据，需要先登录，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/31642416/

python - 使用python 3从网页抓取数据，需要先登录

上一篇：python - 如何在Python中从JSON中删除括号？

下一篇：Python 从现有列表创建新列表