php - 如何抓取网站上的动态内容并保存?

标签 php javascript mysql server-side screen-scraping

例如我需要从http://gmail.com/中抓取免费存储数量:

Over <span id=quota>2757.272164</span> megabytes (and counting) of free storage.

然后将这些数字存储在 MySql 数据库中。 如您所见,该数字正在动态变化。

有没有一种方法可以设置一个服务器端脚本,每次更改时都会获取该数字,并将其保存到数据库中?

谢谢。

最佳答案

由于 Gmail 不提供任何 API 来获取此信息,听起来您想做一些事情 web scraping .

Web scraping (also called Web harvesting or Web data extraction) is a computer software technique of extracting information from websites

有很多方法可以做到这一点,如之前链接的维基百科文章中所述:

Human copy-and-paste: Sometimes even the best Web-scraping technology can not replace human’s manual examination and copy-and-paste, and sometimes this may be the only workable solution when the websites for scraping explicitly setup barriers to prevent machine automation.

Text grepping and regular expression matching: A simple yet powerful approach to extract information from Web pages can be based on the UNIX grep command or regular expression matching facilities of programming languages (for instance Perl or Python).

HTTP programming: Static and dynamic Web pages can be retrieved by posting HTTP requests to the remote Web server using socket programming.

DOM parsing: By embedding a full-fledged Web browser, such as the Internet Explorer or the Mozilla Web browser control, programs can retrieve the dynamic contents generated by client side scripts. These Web browser controls also parse Web pages into a DOM tree, based on which programs can retrieve parts of the Web pages.

HTML parsers: Some semi-structured data query languages, such as the XML query language (XQL) and the hyper-text query language (HTQL), can be used to parse HTML pages and to retrieve and transform Web content.

Web-scraping software: There are many Web-scraping software available that can be used to customize Web-scraping solutions. These software may provide a Web recording interface that removes the necessity to manually write Web-scraping codes, or some scripting functions that can be used to extract and transform Web content, and database interfaces that can store the scraped data in local databases.

Semantic annotation recognizing: The Web pages may embrace metadata or semantic markups/annotations which can be made use of to locate specific data snippets. If the annotations are embedded in the pages, as Microformat does, this technique can be viewed as a special case of DOM parsing. In another case, the annotations, organized into a semantic layer2, are stored and managed separated to the Web pages, so the Web scrapers can retrieve data schema and instructions from this layer before scraping the pages.

在我继续之前,请记住 legal implications这一切。我不知道它是否符合 gmail 的条款,我建议在继续之前检查它们。您也可能最终被列入黑名单或遇到其他类似问题。

综上所述,我想说的是,在您的情况下,您需要某种蜘蛛和 DOM 解析器来登录 gmail 并找到您想要的数据。该工具的选择将取决于您的技术栈。

作为 ruby​​ 开发者,我喜欢使用 Mechanizenokogiri .使用 PHP,您可以查看类似 Sphider 的解决方案.

关于php - 如何抓取网站上的动态内容并保存?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/2645828/

相关文章:

php - DB2如何使用PDO指定连接字符集

php - Laravel 图像上传插入临时位置和文件名而不是正确的名称

javascript - 手机上的剩余悬停效果

javascript - svg 多边形标签的宽度

mysql查询同一报告的不同部分(聚合和详细信息)

PHP 检查当前时间是否小于指定时间

php - 错误消息未发送到登录页面

javascript - 我们如何使用 javascript 找出 webview.MediaPlaybackRequiresUserAction 的值

在 xampp 上使用 phpmyadmin 复制 mysql 数据库

mysql - 如何在 MySQL 的 where 条件中添加断行?