php - 用于解析 HTML 的 DOMDocument(而不是正则表达式)

我正在尝试学习使用 DOMDocument 来解析 HTML 代码。

我只是在做一些简单的工作，我已经喜欢戈登在scrap data using regex and simplehtmldom上的回答了并根据他的工作编写了我的代码。

我发现 PHP.net 上的文档不太好，因为信息有限，几乎没有示例，而且大多数细节都是基于解析 XML。

<?php
$dom = new DOMDocument;
libxml_use_internal_errors(true);
$dom->loadHTMLFile('http://www.nu.nl/internet/1106541/taalunie-keurt-open-sourcewoordenlijst-goed.html');
libxml_clear_errors();

$recipe = array();
$xpath = new DOMXPath($dom);
$contentDiv = $dom->getElementById('page'); // would have preferred getContentbyClass('content') (unique) in this case.

# title
print_r($xpath->evaluate('string(div/div/div/div/div/h1)', $contentDiv));

# content (this is not working)
#print_r($xpath->evaluate('string(div/div/div/div['content'])', $contentDiv)); // if only this worked
print_r($xpath->evaluate('string(div/div/div/div)', $contentDiv));
?>

出于测试目的，我试图获取 nu.nl 新闻文章的标题(h1 标签之间)和内容(HTML)。

如您所见，我可以获得标题，尽管我对该评估字符串不太满意，因为它恰好是该 div 级别上唯一的 h1 标签。

最佳答案

以下是如何使用 DOM 和 XPath 来做到这一点:

$dom = new DOMDocument;
libxml_use_internal_errors(true);
$dom->loadHTMLFile('http://www.nu.nl/…');
libxml_clear_errors();

$xpath = new DOMXPath($dom);
echo $xpath->evaluate('string(id("leadarticle")/div/h1)');
echo $dom->saveHtml(
    $xpath->evaluate('id("leadarticle")/div[@class="content"]')->item(0)
);

XPath string(id("leadarticle")/div/h1) 将返回 h1 的 textContent，该 h1 是 div 的子级，而 div 是具有 id Leadarticle 的元素的子级.

XPath id("leadarticle")/div[@class="content"] 将返回带有 class 属性 content 的 div，该属性是 id 为 Leadarticle 的元素的子元素。

因为您想要内容 div 的外层 HTML，所以您必须获取整个节点而不仅仅是内容，因此没有 string() function in the XPath 。将节点传递给 DOMDocument::saveHTML()方法 ( which is only possible as of 5.3.6 ) 然后将该节点序列化回 HTML。

关于php - 用于解析 HTML 的 DOMDocument(而不是正则表达式)，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/7324620/

php - 用于解析 HTML 的 DOMDocument(而不是正则表达式)

上一篇：perl - 怎么降级大佬？

下一篇：excel - 使用 DoCmd.TransferSpreadsheet 将 Excel 电子表格导入 Access 会创建重复项