我试图开始制作网络爬虫。进展顺利,直到我感到困惑,我无法理解。我写了下面的代码:
我将 http://www.google.com
作为字符串 URL
传递
public void crawlURL(string URL, string depth)
{
if (!checkPageHasBeenCrawled(URL))
{
PageContent = getURLContent(URL);
MatchCollection matches = Regex.Matches(PageContent, "href=\"", RegexOptions.IgnoreCase);
int count = matches.Count;
}
}
private string getURLContent(string URL)
{
string content;
HttpWebRequest request = (HttpWebRequest)HttpWebRequest.Create(URL);
request.UserAgent = "Fetching contents Data";
WebResponse response = request.GetResponse();
Stream stream = response.GetResponseStream();
StreamReader reader = new StreamReader(stream);
content = reader.ReadToEnd();
reader.Close();
stream.Close();
return content;
}
问题: 我正在尝试获取页面的所有链接(http://www.google.com 或任何其他网站),但我看到正则表达式匹配的链接数量较少。它给我的链接数是 19,而当我手动检查源代码中的单词“href=”时,它给了我 41 次出现。我不明白为什么它给我的代码中的字数更少。
最佳答案
我修复并测试了您的正则表达式模式。以下应该更有效地工作。它从 google.ca 获得 11 个匹配项
public void crawlURL(string URL)
{
PageContent = getURLContent(URL);
MatchCollection matches = Regex.Matches(PageContent, "(href=\"https?://[a-z0-9-._~:/?#\\[\\]@!$&'()*+,;=]+(?=\"|$))", RegexOptions.IgnoreCase);
foreach (Match match in matches)
Console.WriteLine(match.Value);
int count = matches.Count;
}
private string getURLContent(string URL)
{
string content;
HttpWebRequest request = (HttpWebRequest)HttpWebRequest.Create(URL);
request.UserAgent = "Fetching contents Data";
WebResponse response = request.GetResponse();
Stream stream = response.GetResponseStream();
StreamReader reader = new StreamReader(stream);
content = reader.ReadToEnd();
reader.Close();
stream.Close();
return content;
}
关于c# - 在让网络爬虫获得总链接数时需要帮助理解混淆,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/13478565/