javascript - Javascript 可以读取任何网页的源代码吗?

标签 javascript html

我正在从事屏幕抓取工作,想检索特定页面的源代码。

如何用javascript实现这个?请帮助我。

最佳答案

简单的入手方式,试试jQuery

$("#links").load("/Main_Page #jq-p-Getting-Started li");

更多信息在 jQuery Docs

另一种以更加结构化的方式进行屏幕抓取的方法是使用 YQL or Yahoo Query Language.它将返回结构化为 JSON 或 xml 的抓取数据。
例如
让我们抓取 stackoverflow.com

select * from html where url="http://stackoverflow.com"

会给你一个像这样的 JSON 数组(我选择了那个选项)

 "results": {
   "body": {
    "noscript": [
     {
      "div": {
       "id": "noscript-padding"
      }
     },
     {
      "div": {
       "id": "noscript-warning",
       "p": "Stack Overflow works best with JavaScript enabled"
      }
     }
    ],
    "div": [
     {
      "id": "notify-container"
     },
     {
      "div": [
       {
        "id": "header",
        "div": [
         {
          "id": "hlogo",
          "a": {
           "href": "/",
           "img": {
            "alt": "logo homepage",
            "height": "70",
            "src": "http://i.stackoverflow.com/Content/Img/stackoverflow-logo-250.png",
            "width": "250"
           }
……..

这样做的美妙之处在于,您可以执行投影和 where 子句,最终让您得到结构化的抓取数据,只有数据是什么你需要(最终通过线路的带宽要少得多)
例如

select * from html where url="http://stackoverflow.com" and
      xpath='//div/h3/a'

会得到你

 "results": {
   "a": [
    {
     "href": "/questions/414690/iphone-simulator-port-for-windows-closed",
     "title": "Duplicate: Is any Windows simulator available to test iPhone application? as a hobbyist who cannot afford a mac, i set up a toolchain kit locally on cygwin to compile objecti … ",
     "content": "iphone\n                simulator port for windows [closed]"
    },
    {
     "href": "/questions/680867/how-to-redirect-the-web-page-in-flex-application",
     "title": "I have a button control ....i need another web page to be redirected while clicking that button .... how to do that ? Thanks ",
     "content": "How\n                to redirect the web page in flex application ?"
    },
…..

现在只得到我们做的问题

select title from html where url="http://stackoverflow.com" and
      xpath='//div/h3/a'

注意投影中的标题

 "results": {
   "a": [
    {
     "title": "I don't want the function to be entered simultaneously by multiple threads, neither do I want it to be entered again when it has not returned yet. Is there any approach to achieve … "
    },
    {
     "title": "I'm certain I'm doing something really obviously stupid, but I've been trying to figure it out for a few hours now and nothing is jumping out at me. I'm using a ModelForm so I can … "
    },
    {
     "title": "when i am going through my project in IE only its showing errors A runtime error has occurred Do you wish to debug? Line 768 Error:Expected')' Is this is regarding any script er … "
    },
    {
     "title": "I have a java batch file consisting of 4 execution steps written for analyzing any Java application. In one of the steps, I'm adding few libs in classpath that are needed for my co … "
    },
    {
……

一旦您编写了查询,它就会为您生成一个 url

http://query.yahooapis.com/v1/public/yql?q=select%20title%20from%20html%20where%20url%3D%22http%3A%2F%2Fstackoverflow.com%22%20and%0A%20%20%20%20%20%20xpath%3D'%2F%2Fdiv%2Fh3%2Fa'%0A%20%20%20%20&format=json&callback=cbfunc

在我们的例子中。

所以最终你最终会做这样的事情

var titleList = $.getJSON(theAboveUrl);

和它一起玩。

很漂亮,不是吗?

关于javascript - Javascript 可以读取任何网页的源代码吗?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/680562/

相关文章:

javascript - 解析中的异步调用失败

html - CSS 样式的应用顺序是什么?

javascript - 隐藏 div 中的 Google map

javascript - 单击按钮并根据检查 b 中的设置值输出消息并打开新页面

javascript - 使用 html 按钮调用加载特定图像的 Javascript 函数

javascript - 侧面导航始终打开/加载时打开

javascript - 将 html 文档中的 css 样式提取到外部 css 文件

jquery - 在 jQuery 中拖动后,mouseleave 处理程序停止工作

php - 如何在实时服务器中执行 json 解码?

javascript - 如何循环嵌套数组?