php - 由于数组索引不匹配,通过网络爬虫提取站点数据会输出错误

标签 php web-crawler

我一直在尝试使用网络爬虫将站点表文本及其链接从给定表(位于 site1.com 中)提取到我的 php 页面。

但是不幸的是,由于php代码中数组索引输入不正确,导致输出错误。

site1.com

<table border="0" cellpadding="0" cellspacing="0" width="100%" class="Table2">
<tbody><tr>
    <td width="1%" valign="top" class="Title2">&nbsp;</td>
    <td width="65%" valign="top" class="Title2">Subject</td>
    <td width="1%" valign="top" class="Title2">&nbsp;</td>
    <td width="14%" valign="top" align="Center" class="Title2">Last Update</td>
    <td width="1%" valign="top" class="Title2">&nbsp;</td>
    <td width="8%" valign="top" align="Center" class="Title2">Replies</td>
    <td width="1%" valign="top" class="Title2">&nbsp;</td>
    <td width="9%" valign="top" align="Center" class="Title2">Views</td>
</tr>
<tr>
    <td width="1%" height="25">&nbsp;</td>
    <td width="64%" height="25" class="FootNotes2"><a href="/files/forum/2017/1/837110.php" target="_top" class="Links2">Serious dedicated study partner for U World</a> - step12013</td>
    <td width="1%" height="25">&nbsp;</td>
    <td width="14%" height="25" class="FootNotes2" align="center">02/11/17 01:50</td>
    <td width="1%" height="25">&nbsp;</td>
    <td width="8%" height="25" align="Center" class="FootNotes2">10</td>
    <td width="1%" height="25">&nbsp;</td>
    <td width="9%" height="25" align="Center" class="FootNotes2">318</td>
</tr>
</tbody>
</table>

PHP.网络爬虫作为::

<?php
    function get_data($url) {
    $ch = curl_init();
    curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
    curl_setopt($ch, CURLOPT_URL,$url);
    $result=curl_exec($ch);
    curl_close($ch);
    return $result;
    }
    $returned_content = get_data('http://www.usmleforum.com/forum/index.php?forum=1');
    $first_step = explode( '<table class="Table2">' , $returned_content );
    $second_step = explode('</table>', $first_step[0]);
    $third_step = explode('<tr>', $second_step[1]);
    // print_r($third_step);
    foreach ($third_step as $key=>$element) {
    $child_first = explode( '<td class="FootNotes2"' , $element );
    $child_second = explode( '</td>' , $child_first[1] );
    $child_third = explode( '<a href=' , $child_second[0] );
    $child_fourth = explode( '</a>' , $child_third[0] );
    $final = "<a href=".$child_fourth[0]."</a></br>";
?>

<li target="_blank" class="itemtitle">
    <?php echo $final?>
</li>

<?php
    if($key==10){
       break;
        }
    }
?>

现在上述 php 代码中的数组索引可能是罪魁祸首。 (我猜) 如果是这样,有人可以解释一下如何完成这项工作吗?

但是我对这段代码的最终要求是: 立即获取上述文本及其关联的链接。

感谢任何帮助..

最佳答案

您可以使用现有的解析器解决方案,例如 Symfony 的 DomCrawler 组件,而不是编写自己的解析器解决方案: http://symfony.com/doc/current/components/dom_crawler.html

$crawler = new Crawler($returned_content);
$linkTexts = $crawler->filterXPath('//a')->each(function (Crawler $node, $i) {
    return $node->text();
});

或者,如果您想自己遍历 DOM 树,您可以使用 DOMDocumentloadHTML http://php.net/manual/en/domdocument.loadhtml.php

$document = new DOMDocument();
$document->loadHTML($returned_content);
foreach ($document->getElementsByTagName('a') as $link) {
    $text = $link->nodeValue;
}

编辑:

为了获取您想要的链接,代码假设您有一个 $returned_content 变量,其中包含您想要解析的 HTML。

// creating a new instance of DOMDocument (DOM = Document Object Model)
$domDocument = new DOMDocument();
// save previous libxml error reporting and set error reporting to internal
// to be able to parse not well formed HTML doc
$previousErrorReporting = libxml_use_internal_errors(true);
$domDocument->loadHTML($returned_content);
libxml_use_internal_errors($previousErrorReporting);
$links = [];
/** @var DOMElement $node */
// getting all <a> element from the HTML
foreach ($domDocument->getElementsByTagName('a') as $node) {
    $parentNode = $node->parentNode;
    // checking if the <a> is under a <td> that has class="FootNotes2"
    $isChildOfAFootNotesTd = $parentNode->nodeName === 'td' && $parentNode->getAttribute('class') === 'FootNotes2';
    // checking if the <a> has class="Links2"
    $isLinkOfLink2Class = $node->getAttribute('class') == 'Links2';
    // as I assumed you wanted links from the <td> this check makes sure that both of the above conditions are fulfilled
    if ($isChildOfAFootNotesTd && $isLinkOfLink2Class) {
        $links[] = [
            'href' => $node->getAttribute('href'),
            'text' => $parentNode->textContent,
        ];
    }
}

print_r($links);

这将为您创建一个类似于以下内容的数组:

Array
(
    [0] => Array
    (
        [href] => /files/forum/2017/1/837242.php
        [text] => Q@Q Drill Time ① - cardio69
    ) 
    [1] => Array
    (
        [href] => /files/forum/2017/1/837356.php
        [text] => study partner in Houston - lacy
    )
    [2] => Array
    (
        [href] => /files/forum/2017/1/837110.php
        [text] => Serious dedicated study partner for U World - step12013
    )
    ...

关于php - 由于数组索引不匹配,通过网络爬虫提取站点数据会输出错误,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/42137646/

相关文章:

python - 我应该创建管道以使用 scrapy 保存文件吗?

javascript onClick 在 php 中不工作

php - 推进和左加入

php - 无法加载 php5apache2_2.dll

php - 日期为 'categories' 的 SQL JOIN

node.js - 爬取数据时如何获取MathJax中的元素?

python - 检查元素中的 HTML 代码与 html 源代码不同

python-3.x - 如何实现广度优先和深度优先搜索网络爬虫?

php - 如果复制数据,再次mysql_real_escape_string?

php - 爬虫如何解析网页中的文本?