php - 如何从页面源中获取 'scrape' 内容?

标签 php scrape

<分区>

我有这段代码可以获取页面的 HTML 源代码:

$page = file_get_contents('http://example.com/page.html');
$page = htmlentities($page);

我想从中抓取一些内容。例如,假设页面的源代码包含以下内容:

<strong>technorati.com</strong><br />
Connection failed<br /><br />Pinging <strong>icerocket.com</strong><br />
Connection failed<br /><br />Pinging <strong>weblogs.com</strong><br />
Done<br /><br />Pinging <strong>newsgator.com</strong><br />
Done<br /><br />Pinging <strong>blo.gs</strong><br />
Done<br /><br />Pinging <strong>feedburner.com</strong><br />
Done<br /><br />Pinging <strong>blogstreet.com</strong><br />
Done<br /><br />Pinging <strong>my.yahoo.com</strong><br />
Connection failed<br /><br />Pinging <strong>moreover.com</strong><br />
Connection failed<br /><br />Pinging <strong>newsisfree.com</strong><br />
Done<br />

有没有办法从源代码中抓取它并将其存储在一个变量中,所以它看起来像这样:

technorati.com Connection failed
icerocket.com Connection failed
eblogs.com Done
Ect.

当然,页面是动态的,这就是我遇到问题的原因。我可以搜索源代码中的每个站点吗?但是我怎么会得到它之后的结果呢? (连接失败/完成)
非常感谢您的帮助!

最佳答案

我已经尝试使用简单的 HTML DOM PHP 库抓取多个站点,可以在这里获得:http://simplehtmldom.sourceforge.net/

然后使用这样的代码:

<?php
include_once 'simple_html_dom.php';

$url = "http://slashdot.org/";
$html = file_get_html($url);

//remove additional spaces
$pat[0] = "/^\s+/";
$pat[1] = "/\s{2,}/";
$pat[2] = "/\s+\$/";
$rep[0] = "";
$rep[1] = " ";
$rep[2] = "";

foreach($html->find('h2') as $heading) { //for each heading
        //find all spans with a inside then echo the found text out
        echo preg_replace($pat, $rep, $heading->find('span a', 0)->plaintext) . "\n"; 
}
?>

结果如下:

5.8 Earthquake Hits East Coast of the US
Origins of Lager Found In Argentina
Inside Oregon State University's Open Source Lab
WebAPI: Mozilla Proposes Open App Interface For Smartphones
Using Tablets Becoming Popular Bathroom Activity
The Syrian Government's Internet Strategy
Deus Ex: Human Revolution Released
Taken Over By Aliens? Google Has It Covered
The GIMP Now Has a Working Single-Window Mode
Zombie Cookies Just Won't Die
Motorola's Most Important 18 Patents
MK-1 Robotic Arm Capable of Near-Human Dexterity, Dancing
Evangelical Scientists Debate Creation Story
Android On HP TouchPad
Google Street View Gets Israeli Government's Nod
Internet Restored In Tripoli As Rebels Take Control
GA Tech: Internet's Mid-Layers Vulnerable To Attack
Serious Crypto Bug Found In PHP 5.3.7
Twitter To Meet With UK Government About Riots
EU Central Court Could Validate Software Patents

关于php - 如何从页面源中获取 'scrape' 内容?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/7321474/

相关文章:

python - 使用 SoupStrainer 进行选择性解析

javascript - awesomium 网络抓取某些部分

php - Laravel 4 中的自定义验证

php - 尽管电子邮件已发布在数据库中,但忘记密码电子邮件链接 (php) 返回错误

php - Propel 错误 - 未写入 mysql 表

php - JSON 中的奇怪字符(在 cURL 身份验证后获得)

php - 在 Apache (EC2) 上文件上传速度比用户的网络上传速度慢

python - 如何在 Python 中使用 Beautifulsoup 从 HTML 中提取标签

javascript - PhantomJS 和 Google Chrome/Firefox 的 HTML 输出不同

python - 如何从 BeautifulSoup 下载图片?