c# - 屏幕抓取执行 ajax 请求的页面

标签 c# screen-scraping

我正在尝试使用 c# 中的屏幕抓取和 httpwebrequest 来获取页面的 html

当我尝试废弃一个普通页面时,它工作得很好。但是现在,如果我尝试获取正在ajax请求上加载的页面的html,我就会遇到麻烦......这是它发送时发送的两个请求我尝试获取该页面..

绕过登录的正常请求

POST (http)://example/user/login?destination=/events/Sports HTTP/1.1
Host: example
User-Agent: Mozilla/5.0 (Windows NT 5.1; rv:12.0) Gecko/20100101 Firefox/12.0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-us,en;q=0.5
Accept-Encoding: gzip, deflate
Connection: keep-alive
Referer: (http)://example/user/login?destination=/events/Sports
Cookie: has_js=1; SESSee201d4242c83ea2671330cdceee4623=qdco8gukcm2pk9offdof1uv3a0
Content-Type: application/x-www-form-urlencoded
Content-Length: 121

name=Username&pass=Password&remember_me=1&form_build_id=form11cb87efa605eb9fb384eb9d2a2c686e&form_id=user_login&op=Go

AJAX请求获取数据

GET (http)://example/views/ajax?name=Sports&view_name=Events&view_display_id=page_1&view_args=Sports&view_path=events%2FSports&view_base_path=events&view_dom_id=1&pager_element=0 HTTP/1.1
Host: example
User-Agent: Mozilla/5.0 (Windows NT 5.1; rv:12.0) Gecko/20100101 Firefox/12.0
Accept: application/json, text/javascript, */*
Accept-Language: en-us,en;q=0.5
Accept-Encoding: gzip, deflate
Connection: keep-alive
X-Requested-With: XMLHttpRequest
Referer: (http)://example//events/Sports
Cookie: has_js=1; SESSee201d4242c83ea2671330cdceee4623=vd36esbpe8065snbfo39ubhmk3
If-Modified-Since: Wed, 23 May 2012 08:13:51 GMT

我尝试过以这种方式编写代码,但它不起作用..

string sid = String.Empty;
string uri = "http://example/user/login?destination=/events/Sports";
string postData = string.Format("name=UserName&pass=Password&remember_me=1&form_build_id=form-11cb87efa605eb9fb384eb9d2a2c686e&form_id=user_login&op=Go");
byte[] postBytes = Encoding.UTF8.GetBytes(postData);

//web request

HttpWebRequest req = (HttpWebRequest)HttpWebRequest.Create(uri);
req.UserAgent = "Mozilla/5.0 (Windows NT 5.1; rv:12.0) Gecko/20100101 Firefox/12.0";

req.KeepAlive = true;

////set the cookie
Cookie cookie = new Cookie();
cookie.Name = "Sports";

cookie.Domain = "SESSee201d4242c83ea2671330cdceee4623";
req.CookieContainer = new CookieContainer();
req.CookieContainer.Add(cookie);

req.Headers.Add("Accept-Encoding", "gzip, deflate");
req.Headers.Add("Accept-Language", "en-us,en;q=0.5");
req.Method = "POST";
req.Host = "example";
req.Referer = "http://example/user/login?destination=/events/Sports";
req.Accept = "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8";

req.ContentType = "application/x-www-form-urlencoded";
req.ContentLength = postBytes.Length;

//getting the request stream and posting data
StreamWriter requestwriter = new StreamWriter(req.GetRequestStream(), System.Text.Encoding.ASCII);
requestwriter.Write(postData);
requestwriter.Close();
string url = "http://example/views/ajax?name=Sports&view_name=Events&view_display_id=page_1&view_args=Sports&view_path=events%2FSports&view_base_path=events&view_dom_id=1&pager_element=0";

HttpWebRequest request = (HttpWebRequest)HttpWebRequest.Create(url);
request.UserAgent = "Mozilla/5.0 (Windows NT 5.1; rv:12.0) Gecko/20100101 Firefox/12.0";
request.KeepAlive = true;
request.Headers.Add("X-Requested-With", "XMLHttpRequest");
request.Headers.Add("Accept-Encoding", "gzip, deflate");
request.Headers.Add("Accept-Language", "en-us,en;q=0.5");
request.Host = "example";
request.Method = "GET";
request.Referer = "http://example//events/Sports";
request.Accept = "application/json, text/javascript, */*";

request.CookieContainer.Add(cookie);
request.ContentType = "text/javascript; charset=utf-8";

try
{
    HttpWebResponse res = (HttpWebResponse)request.GetResponse();
    StreamReader sr = new StreamReader(res.GetResponseStream());
    sid = sr.ReadToEnd().Trim();
}
catch {}

我在字符串 sid 中​​得到 { "status": false, "display": "", "messages": ""} 相反,它应该给我状态 true 并显示一些值..

最佳答案

您要为每个请求创建一个新的 CookieContainer,因此假设您的登录请求返回一个包含您的 session ID 的 Cookie,那么您的下一个请求将不会在 Cookie 容器中包含该 session ID。

对两个请求使用相同的 CookieContainer,即:

   //web request

    HttpWebRequest req = (HttpWebRequest)HttpWebRequest.Create(uri);
    req.UserAgent = "Mozilla/5.0 (Windows NT 5.1; rv:12.0) Gecko/20100101 Firefox/12.0";

    req.KeepAlive = true;

    CookieContainer cookies = new CookieContainer();
    req.CookieContainer = cookies;

    req.Headers.Add("Accept-Encoding", "gzip, deflate");
    req.Headers.Add("Accept-Language", "en-us,en;q=0.5");
    req.Method = "POST";
    req.Host = "example";
    req.Referer = "http://example/user/login?destination=/events/Sports";
    req.Accept = "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8";

    req.ContentType = "application/x-www-form-urlencoded";
      req.ContentLength = postBytes.Length;

    //getting the request stream and posting data
      StreamWriter requestwriter = new StreamWriter(req.GetRequestStream(), System.Text.Encoding.ASCII);
      requestwriter.Write(postData);
      requestwriter.Close();

      var firstResponse = req.GetResponse();
      using(var sr = new StreamReader(firstResponse.GetResponseStream()) {
          sr.ReadToEnd();
      }

      string url = "http://example/views/ajax?name=Sports&view_name=Events&view_display_id=page_1&view_args=Sports&view_path=events%2FSports&view_base_path=events&view_dom_id=1&pager_element=0";

      HttpWebRequest request = (HttpWebRequest)HttpWebRequest.Create(url);
      request.UserAgent = "Mozilla/5.0 (Windows NT 5.1; rv:12.0) Gecko/20100101 Firefox/12.0";
      request.KeepAlive = true;
      request.Headers.Add("X-Requested-With", "XMLHttpRequest");
      request.Headers.Add("Accept-Encoding", "gzip, deflate");
      request.Headers.Add("Accept-Language", "en-us,en;q=0.5");
      request.Host = "example";
      request.Method = "GET";
      request.Referer = "http://example//events/Sports";
      request.Accept = "application/json, text/javascript, */*";

      request.CookieContainer = cookies;
      request.ContentType = "text/javascript; charset=utf-8";

    try
    {
        HttpWebResponse res = (HttpWebResponse)request.GetResponse();
        StreamReader sr = new StreamReader(res.GetResponseStream());
        sid = sr.ReadToEnd().Trim();

关于c# - 屏幕抓取执行 ajax 请求的页面,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/10717079/

相关文章:

c# - 使用 DateTime.ParseExact 时,如何指定给定日期的时区?

ruby-on-rails - 在页面上查找匹配 HREF 模式的链接

c# - Abot 网络爬虫性能

c# - 使用 DataStax C# 驱动程序时处理 Cassandra 中的所有节点关闭

c# - 如何在C#中将一棵树插入另一棵树

c# - 如何从控制台应用程序访问 UserManager 和 RoleManager?

c# - MVC3 将模型传递给 Controller ​​ - 接收空值

python - 用户在使用 Flask 构建的网站上第二次提交表单后,无法成功执行 python 网页抓取脚本

python - 从图书馆目录中抓取信息

screen-scraping - Python 碎片 : allowed_domains adding new domains from database