PHP-Dom 处理 : Code-review of a little Parser-programme

非常感谢您运行这个委员会。我喜欢这个网站。它经常帮助我!你们都是很棒的伙伴。我今天所做的是开发一个小型 php 解析器!

我需要从这个网站获取所有数据。查看目标:www.aktive-buergerschaft.de/buergerstiftungen/unsere_leistungen/buergerstiftungsfinder 我正在尝试从网页中抓取数据，但我需要获取此链接中的所有数据。我想将数据存储在Mysql-db中，以便更好地检索!

看一个例子:

我需要从该网站获取所有数据。

查看目标:see this link here: Foundations in Germany - click here

我正在尝试从网页中抓取数据，但我需要获取此链接中的所有数据。

看一个例子:

Bürgerstiftung Lebensraum Aachen
    rechtsfähige Stiftung des bürgerlichen Rechts
    Ansprechpartner: Hubert Schramm
    Alexanderstr. 69/ 71
    52062 Aachen
    Telefon: 0241 - 4500130
    Telefax: 0241 - 4500131
    Email: info@buergerstiftung-aachen.de
    www.buergerstiftung-aachen.de
    >> Weitere Details zu dieser Stiftung

Bürgerstiftung Achim
    rechtsfähige Stiftung des bürgerlichen Rechts
    Ansprechpartner: Helga Kühn
    Rotkehlchenstr. 72
    28832 Achim
    Telefon: 04202-84981
    Telefax: 04202-955210
    Email: info@buergerstiftung-achim.de
    www.buergerstiftung-achim.de
    >> Weitere Details zu dieser Stiftung

我需要链接“背后”的数据 - 有什么方法可以做到这一点有一个简单易懂的解析器 - 一个新手可以理解和编写的解析器!？好吧，我可以用 XPahts 来做到这一点 - 在 PHP 或 Perl 中 - (使用 mechanize)

我从 php 方法开始:但是 - 如果我运行代码(见下文)，我会得到这个结果

PHP Fatal error:  Call to undefined function file_get_html() in /home/martin/perl/foundations/arbie_finder_de.php on line 5
martin@suse-linux:~/perl/foundations> cd foundations

由此处的代码引起

<?php

// Create DOM from URL or file
$html = file_get_html('www.aktive-buergerschaft.de/buergerstiftungen/unsere_leistungen/buergerstiftungsfinder');

// split it via body, so you only get to the contents inside body tag
$split = split('<body>', $html);
// it is usually in the top of the array but just check to be sure
$body = $split[1];
// split again with, say,<p class="divider">A</p>
$split = split('<p class="divider">A</p>', $body);
// now this should contain just the data table you want to process
$data = $split[1];

// Find all links from original html
foreach($html->find('a') as $element) {
       $link = $element->href;

       // check if this link is in our data table
       if(substr_count($data, $link) > 0) {
           // link is in our data table, follow the link
           $html = file_get_html($link);
          // do what you have to do
       }
}


?>

关于我的方法的一些思考:

废弃页面的标准做法是:

将页面读入字符串(file_get_html 或现在正在使用的任何内容)
分割字符串，这取决于页面结构。首先通过分割它，因此数组的一个元素将包含主体，依此类推，直到我们得到目标。好吧，我猜最终的分割将是

一个

，因为它具有我们上面描述的链接:

如果我们希望点击链接，只需重复相同的过程，但使用链接即可。
或者，我们可以搜索一个 PHP 代码片段来获取页面中的所有链接。如果我们已经完成了 1 和 2，并且现在标签内只有字符串，那就更好了。这样就简单多了。

嗯 - 我的问题是:这个错误会导致什么 - 我没有胶水......如果你有一个想法，那就太好了

更新:嗯 - 我可以尝试这个:

承认它并不比使用 simple_html_dom 更简单。

$records = array();
foreach($html->find('#content dl') as $contact) {
   $record = array();
   $record["name"] = $contact->find("dt", 0)->plaintext;
   foreach($contact->find("dd") as $field) {
       /* parse each $field->plaintext in order to obtain $fieldname */
       $record[$fieldname] = $field->plaintext;
   }
   $records[] = $record;
}

嗯 - 我尝试从这里开始工作。也许我使用最新版本的 PHP 来获得类似 jQuery 的语法.... 嗯...

任何想法

最佳答案

我确实想指出，在您考虑抓取任何网站之前，您需要考虑这样做的法律和道德影响。如果这不是您的网站，或者您没有获得所有者的许可，您可能不应该进行抓取。如果不是供个人使用，您尤其不应该抓取。请小心...

首先，您需要在 $data = $split[1] 之后添加一个分号 (;)，这样可以消除 PHP 语法错误。我对第一个错误有点困惑，指的是 *，因为代码中的任何地方都没有 *。

在语法错误消失之后，您似乎就可以编写 MySQL 查询并插入您的发现了。

您还可以考虑以下内容:

foreach($html->find('a') as $element) 
   echo $element->href;

希望对您有所帮助。

关于PHP-Dom 处理 : Code-review of a little Parser-programme，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/6235562/

PHP-Dom 处理 : Code-review of a little Parser-programme

上一篇：php - 我如何从数据库中回显这样的变量？

下一篇：java - 使用 JavaMail 在超过截止日期时自动向用户发送电子邮件