c# - HttpClient 使用登录 c# 从网站抓取数据

标签 c# httpclient html-agility-pack

我想从以下网站抓取一些数据:

http://wttv.click-tt.de/cgi-bin/WebObjects/nuLigaTTDE.woa/wa/teamPortrait?teamtable=1673669&pageState=rueckrunde&championship=SK+Bez.+BB+13%2F14&group=204559# .

该网站包含一些关于乒乓球的数据。无需登录即可访问实际赛季,而上一季仅需登录。对于实际的季节,我已经创建了一些代码来从中获取数据并且工作正常。我正在使用 HtmlAgilityPack 中的 HttpClient。代码如下所示:

            HttpClient http = new HttpClient();
            var response = await http.GetByteArrayAsync(website);
            String source = Encoding.GetEncoding("utf-8").GetString(response, 0, response.Length - 1);
            source = WebUtility.HtmlDecode(source);
            HtmlDocument resultat = new HtmlDocument();
            resultat.LoadHtml(source);

            Do something to get the relevant data from resultat by scanning the DocumentNodes from resultat...

现在我想从需要登录的网站获取数据。有没有人知道如何登录网站并获取数据?登录必须通过单击“Ergebnishistorie freischalten ...”然后输入用户名和密码来完成。

最佳答案

登录网站的方法有很多种,这取决于特定站点使用的身份验证方法(表单例份验证、基本身份验证、Windows 身份验证等)。通常网站使用 FormsAuthentication。

要使用 HttpClient 在标准 FormsAuthentication 网站中执行登录,您需要设置 CookieContainer,因为身份验证数据将设置在 cookie 上。

在您的具体示例中,登录表单对 HTTPS 中的任何页面进行 POST,我使用了 https://wttv.click-tt.de/cgi-bin/WebObjects/nuLigaTTDE.woa/wa/teamPortrait?teamtable=1673669&pageState=rueckrunde&championship=SK+Bez.+BB+13%2F14&group=204559举个例子。这是使用 HttpClient 发出请求的代码:

var baseAddress = new Uri("https://wttv.click-tt.de/");
var cookieContainer = new CookieContainer();
using (var handler = new HttpClientHandler() { CookieContainer = cookieContainer })
using (var client = new HttpClient(handler) { BaseAddress = baseAddress })
{
    //usually i make a standard request without authentication, eg: to the home page.
    //by doing this request you store some initial cookie values, that might be used in the subsequent login request and checked by the server
    var homePageResult = client.GetAsync("/");
    homePageResult.Result.EnsureSuccessStatusCode();

    var content = new FormUrlEncodedContent(new[]
    {
        //the name of the form values must be the name of <input /> tags of the login form, in this case the tag is <input type="text" name="username">
        new KeyValuePair<string, string>("username", "username"),
        new KeyValuePair<string, string>("password", "password"),
    });
    var loginResult = client.PostAsync("/cgi-bin/WebObjects/nuLigaTTDE.woa/wa/teamPortrait?teamtable=1673669&pageState=rueckrunde&championship=SK+Bez.+BB+13%2F14&group=204559", content).Result;
    loginResult.EnsureSuccessStatusCode();

    //make the subsequent web requests using the same HttpClient object
}

但是,许多网站使用一些 javascript 加载的表单值,甚至更多的一些验证码控件,显然这种解决方案是行不通的。这可以像使用 WebBrowser 控件那样完成(通过在表单字段上自动输入用户然后单击登录按钮,此链接有一个示例:https://social.msdn.microsoft.com/Forums/vstudio/en-US/0b77ca8c-48ce-4fa8-9367-c7491aa359b0/yahoo-login-via-systemnetsockets-namespace?forum=vbgeneral)。

作为检查您所需网站上的登录如何工作的一般规则,请使用 Fiddler:http://www.telerik.com/fiddler : 当你在你的网站上点击登录按钮时,观察Fiddler并找到登录请求(通常是你点击“登录”按钮后的第一个请求,通常是一个POST请求)。

然后检查请求数据(选择请求并转到“检查器”-“TextView”选项卡)并尝试在您的代码中复制请求。

On the left pane there are all the requests intercepted by Fiddler, on the right pane there are the request and response inspectors

左边是Fiddler拦截的所有请求,右边是请求和响应检查器(上面是请求检查器,下面是响应检查器)

编辑

与旧 WebRequest 类相同的代码:http://rextester.com/LLP86817

var cookieContainer = new CookieContainer();

HttpWebRequest request = (HttpWebRequest)HttpWebRequest.Create("https://wttv.click-tt.de/");
request.CookieContainer = cookieContainer;
//set the user agent and accept header values, to simulate a real web browser
request.UserAgent = "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.101 Safari/537.36";
request.Accept = "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8";


//SET AUTOMATIC DECOMPRESSION
request.AutomaticDecompression = DecompressionMethods.Deflate | DecompressionMethods.GZip;

Console.WriteLine("FIRST RESPONSE");
Console.WriteLine();
using (WebResponse response = request.GetResponse())
{
    using (StreamReader sr = new StreamReader(response.GetResponseStream()))
    {
        Console.WriteLine(sr.ReadToEnd());
    }
}

request = (HttpWebRequest)HttpWebRequest.Create("https://wttv.click-tt.de/cgi-bin/WebObjects/nuLigaTTDE.woa/wa/teamPortrait?teamtable=1673669&pageState=rueckrunde&championship=SK+Bez.+BB+13%2F14&group=204559");
//set the cookie container object
request.CookieContainer = cookieContainer;
request.UserAgent = "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.101 Safari/537.36";
request.Accept = "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8";

//set method POST and content type application/x-www-form-urlencoded
request.Method = "POST";
request.ContentType = "application/x-www-form-urlencoded";

//SET AUTOMATIC DECOMPRESSION
request.AutomaticDecompression = DecompressionMethods.Deflate | DecompressionMethods.GZip;

//insert your username and password
string data = string.Format("username={0}&password={1}", "username", "password");
byte[] bytes = System.Text.Encoding.UTF8.GetBytes(data);

request.ContentLength = bytes.Length;

using (Stream dataStream = request.GetRequestStream())
{
    dataStream.Write(bytes, 0, bytes.Length);
    dataStream.Close();
}

Console.WriteLine("LOGIN RESPONSE");
Console.WriteLine();
using (WebResponse response = request.GetResponse())
{
    using (StreamReader sr = new StreamReader(response.GetResponseStream()))
    {
        Console.WriteLine(sr.ReadToEnd());
    }
}

//request = (HttpWebRequest)HttpWebRequest.Create("INTERNAL PROTECTED PAGE ADDRESS");
//After a successful login, you must use the same cookie container for all request
//request.CookieContainer = cookieContainer;

//....

关于c# - HttpClient 使用登录 c# 从网站抓取数据,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/32860666/

相关文章:

c# - 在 td 中的项目符号和文本之间添加空格

c# - 在 Telerik href URL 中包含带符号的参数值

c# - jquery ajax 调用可以在本地主机上运行,​​但不能在实时服务器上运行

c# - 如何从 HttpClient 获取最后一个 url?

c# - Html 敏捷包实现

c# - .NET 中的泛型方法无法推断其返回类型。为什么?

azure - 从 Azure 托管的 .Net Core 3.1 应用程序访问 Kroger API 时出现 403 禁止

java - 如何使用 HTTPClient 的 HEAD 方法获取所有 header

c# - HTML 敏捷包获取所有带有类的 div

c# - 使用 HTML Agility Pack 获取内部文本