python - 使用Python和Mechanize提交表单数据并进行身份验证

标签 python networking screen-scraping mechanize

我想提交登录到 Reddit.com 网站,导航到页面的特定区域,然后提交评论。我看不出这段代码有什么问题,但它不起作用,因为 Reddit 网站上没有反射(reflect)任何更改。

import mechanize
import cookielib


def main():

#Browser
br = mechanize.Browser()


# Cookie Jar
cj = cookielib.LWPCookieJar()
br.set_cookiejar(cj)

# Browser options
br.set_handle_equiv(True)
br.set_handle_gzip(True)
br.set_handle_redirect(True)
br.set_handle_referer(True)
br.set_handle_robots(False)

# Follows refresh 0 but not hangs on refresh > 0
br.set_handle_refresh(mechanize._http.HTTPRefreshProcessor(), max_time=1)

#Opens the site to be navigated
r= br.open('http://www.reddit.com')
html = r.read()

# Select the second (index one) form
br.select_form(nr=1)

# User credentials
br.form['user'] = 'DUMMYUSERNAME'
br.form['passwd'] = 'DUMMYPASSWORD'

# Login
br.submit()

#Open up comment page
r= br.open('http://www.reddit.com/r/PoopSandwiches/comments/f47f8/testing/')
html = r.read()

#Text box is the 8th form on the page (which, I believe, is the text area)
br.select_form(nr=7)

#Change 'text' value to a testing string
br.form['text']= "this is an automated test"

#Submit the information  
br.submit()

这是怎么回事?

最佳答案

如果可能的话,我肯定会建议尝试使用 API,但这对我有用(不适用于您的示例帖子,它已被删除,但适用于任何活跃的帖子):

#!/usr/bin/env python

import mechanize
import cookielib
import urllib
import logging
import sys

def main():

    br = mechanize.Browser()
    cj = cookielib.LWPCookieJar()
    br.set_cookiejar(cj)

    br.set_handle_equiv(True)
    br.set_handle_gzip(True)
    br.set_handle_redirect(True)
    br.set_handle_referer(True)
    br.set_handle_robots(False)

    br.set_handle_refresh(mechanize._http.HTTPRefreshProcessor(), max_time=1)

    r= br.open('http://www.reddit.com')

    # Select the second (index one) form
    br.select_form(nr=1)

    # User credentials
    br.form['user'] = 'user'
    br.form['passwd'] = 'passwd'

    # Login
    br.submit()

    # Open up comment page
    posting = 'http://www.reddit.com/r/PoopSandwiches/comments/f47f8/testing/'
    rval = 'PoopSandwiches'
    # you can get the rval in other ways, but this will work for testing

    r = br.open(posting)

    # You need the 'uh' value from the first form
    br.select_form(nr=0)
    uh = br.form['uh']

    br.select_form(nr=7)
    thing_id = br.form['thing_id']
    id = '#' + br.form.attrs['id']
    # The id that gets posted is the form id with a '#' prepended.

    data = {'uh':uh, 'thing_id':thing_id, 'id':id, 'renderstyle':'html', 'r':rval, 'text':"Your text here!"}
    new_data_dict = dict((k, urllib.quote(v).replace('%20', '+')) for k, v in data.iteritems())

    # not sure if the replace needs to happen, I did it anyway
    new_data = 'thing_id=%(thing_id)s&text=%(text)s&id=%(id)s&r=%(r)s&uh=%(uh)s&renderstyle=%(renderstyle)s' %(new_data_dict)

    # not sure which of these headers are really needed, but it works with all
    # of them, so why not just include them.
    req = mechanize.Request('http://www.reddit.com/api/comment', new_data)
    req.add_header('Referer', posting)
    req.add_header('Accept', ' application/json, text/javascript, */*')
    req.add_header('Content-Type', 'application/x-www-form-urlencoded; charset=UTF-8')
    req.add_header('X-Requested-With', 'XMLHttpRequest')
    cj.add_cookie_header(req)
    res = mechanize.urlopen(req)

main()

关闭 javascript 并查看如何处理 reddit 评论会很有趣。现在有一堆 magic 发生在发帖时调用的 onsubmit 函数中。这是添加 uhid 值的地方。

关于python - 使用Python和Mechanize提交表单数据并进行身份验证,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/4720470/

相关文章:

C# 从 wiki 页面抓取数据(屏幕抓取)

python - 使用 Angularjs 发送 POST 请求并在 Flask 后端获取参数

android - 如何在 splash android 上读取互联网状态

java - 如何修复NullPointerException,jsoup元素

ruby-on-rails - 在 Rails 3 中使用 Scrapi .. 出现段错误/中止陷阱

java - 抛出 SocketTimeoutException

python - 在 Python 中计算绝对值

python - 使用 PyCUDA 连接 cuSOLVER-sparse

Python 正则表达式 : why does this not work?

java - 通过 NAT 将 UDP 数据包发送到 IP