node.js - 使用nodejs对非结构化html进行爬虫

我需要抓取/废弃静态非结构化 HTML，我正在尝试使用 Node.js 代码获取内容，我尝试使用 Cheerio 和 xpath 失败。

http://static.puertos.es/pred_simplificada/Predolas/Tablas/Cnt/PAS.html

要获取的第一个元素的 Xpath 是/html/body/center/center/table/tbody/tr[3]，然后我需要获取 TR 中的每个 TD 文本。

如果尝试获取tbody Node

      var parser = new parse5.Parser();
      var document = parser.parse(response.toString());
      var xhtml = xmlser.serializeToString(document);
      var doc = new dom().parseFromString(xhtml);
      var select = xpath.useNamespaces({"x": "http://www.w3.org/1999/xhtml"});
      var nodes = select("//x:tbody", doc);

我总是收到一个[] Node 。

使用cheerio，我尝试迭代TR元素，但正如我上面提到的，没有成功。

var $ = cheerio.load(response);
$('tr').each(function(i, e) {
    console.log("Content %j", $(e));
});

最佳答案

这表明 Cheerio 无法在非结构化且没有 CSS HTML 的情况下正常工作。因此，我在 that tutorial 之后使用 YQL 尝试了另一种解决方法

select * from html where url='http://static.puertos.es/pred_simplificada/Predolas/Tablas/Cnt/PAS.html' and xpath='//html/body/center/center/table/tbody'

通过 yql，我得到了我所需要的，所以我将集成它 node-yql

关于node.js - 使用nodejs对非结构化html进行爬虫，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/33466777/

上一篇：node.js - 在 Ionic 中使用 ngCordova 和 PushPlugin 的 azure 通知中心客户端代码

下一篇：node.js - 从 Node.js 应用程序管理 git 存储库

相关文章：

java - 使用 XPATH 在 Java 中解析 XML 节点

python - Scrapy:使用scrapy和xpath时如何同时获取文本和带有<b>标签的文本？

web-crawler - 爬取产品详细信息页面时动态分配列？

c# - 如何以编程方式登录 SharePoint Online 并获取 Web HTML？

node.js - 数组无意中被 supertest 转换为对象

javascript - 在 javaScript 中使用 fetch API 上传文件

java - 无法在 XPATH 中使用//td[text() ="Ref. :"] 检索值

ruby-on-rails - Rails 回合制游戏管理器

javascript - 制定 toast 指令

concurrency - 如何使用专用 channel 在 go 中发出抓取作业结束的信号