我想创建一个 PHP 函数,它遍历网站的主页,找到主页中的所有链接,遍历它找到的链接并继续运行,直到该网站上的所有链接都是最终链接。我真的需要构建这样的东西,这样我就可以爬取我的站点网络并提供“一站式”搜索。
这是我到目前为止得到的 -
function spider($urltospider, $current_array = array(), $ignore_array = array('')) {
if(empty($current_array)) {
// Make the request to the original URL
$session = curl_init($urltospider);
curl_setopt($session, CURLOPT_RETURNTRANSFER, true);
$html = curl_exec($session);
curl_close($session);
if($html != '') {
$dom = new DOMDocument();
@$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$hrefs = $xpath->evaluate("/html/body//a");
for($i = 0; $i < $hrefs->length; $i++) {
$href = $hrefs->item($i);
$url = $href->getAttribute('href');
if(!in_array($url, $ignore_array) && !in_array($url, $current_array)) {
// Add this URL to the current spider array
$current_array[] = $url;
}
}
} else {
die('Failed connection to the URL');
}
} else {
// There are already URLs in the current array
foreach($current_array as $url) {
// Connect to this URL
// Find all the links in this URL
// Go through each URL and get more links
}
}
}
唯一的问题是,我似乎无法理解如何继续。谁能帮我吗?基本上,此函数将重复自身,直到找到所有内容。
最佳答案
我不是 PHP 专家,但您似乎把它复杂化了。
function spider($urltospider, $current_array = array(), $ignore_array = array('')) {
if(empty($current_array)) {
$current_array[] = $urltospider;
$cur_crawl = 0;
while ($cur_crawl < len($current_array)) { //don't use foreach because that can get messed up if you change the array while inside the loop.
$links_found = crawl($current_array($cur_crawl)); //crawl should return all links found in the given page
//Now keep adding $links_found to $current_array. Maybe you can check if any of the links found are already in $current_array so you don't crawl them multiple times
$current_array = array_merge($current_array, $links_found);
$cur_crawl += 1;
}
return $current_array;
}
关于php - 我怎样才能创建一个重复自身直到找到所有信息的函数?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/3274504/