java - 如何使用Jsoup从html数据中获取图片来源和描述

标签 java html jsoup

我正在尝试解析原子提要以使用 ROME API 提取提要。原子提要给我内容属性,其中包含文章的图像和描述。 这是原子提要的网址:https://news.google.com/news/section?output=atom&ned=in&q=narendra%20modi . 现在我想从内容部分中提取图像和描述。

 <entry>
<id>tag:news.google.com,2005:cluster=http://www.ndtv.com/india-news/not-just-gst-stuck-in-parliament-matter-of-sorrow-pm-narendra-modi-1253222</id>
<title type="html">'Not Just GST Stuck In Parliament. Matter of Sorrow': PM Narendra Modi - NDTV</title>
<updated>2015-12-10T06:03:54Z</updated>
<link rel="alternate" type="text/html" href="http://news.google.com/news/url?sa=t&amp;fd=R&amp;ct2=in&amp;usg=AFQjCNE53SQd2skoJLxBTVlYWHdgDBCl7Q&amp;clid=c3a7d30bb8a4878e06b80cf16b898331&amp;cid=52779006372283&amp;ei=ACdpVoDJO9Sj4ALYkL94&amp;url=http://www.ndtv.com/india-news/not-just-gst-stuck-in-parliament-matter-of-sorrow-pm-narendra-modi-1253222" hreflang="en"/>
<content type="html">&lt;table border="0" cellpadding="2" cellspacing="7" style="vertical-align:top;">&lt;tr>&lt;td width="80" align="center" valign="top">&lt;font style="font-size:85%;font-family:arial,sans-serif">&lt;a href="http://news.google.com/news/url?sa=t&amp;amp;fd=R&amp;amp;ct2=in&amp;amp;usg=AFQjCNE53SQd2skoJLxBTVlYWHdgDBCl7Q&amp;amp;clid=c3a7d30bb8a4878e06b80cf16b898331&amp;amp;cid=52779006372283&amp;amp;ei=ACdpVoDJO9Sj4ALYkL94&amp;amp;url=http://www.ndtv.com/india-news/not-just-gst-stuck-in-parliament-matter-of-sorrow-pm-narendra-modi-1253222">&lt;img src="//t3.gstatic.com/images?q=tbn:ANd9GcSNi4SJFo9q9PXKPOjJkiUlfk2GFRzRoBlwK6UsiSQ8np66JDvgQiYTdN4Fknntb7bVjdR-NuM" alt="" border="1" width="80" height="80">&lt;br>&lt;font size="-2">NDTV&lt;/font>&lt;/a>&lt;/font>&lt;/td>&lt;td valign="top" class="j">&lt;font style="font-size:85%;font-family:arial,sans-serif">&lt;br>&lt;div style="padding-top:0.8em;">&lt;img alt="" height="1" width="1">&lt;/div>&lt;div class="lh">&lt;a href="http://news.google.com/news/url?sa=t&amp;amp;fd=R&amp;amp;ct2=in&amp;amp;usg=AFQjCNE53SQd2skoJLxBTVlYWHdgDBCl7Q&amp;amp;clid=c3a7d30bb8a4878e06b80cf16b898331&amp;amp;cid=52779006372283&amp;amp;ei=ACdpVoDJO9Sj4ALYkL94&amp;amp;url=http://www.ndtv.com/india-news/not-just-gst-stuck-in-parliament-matter-of-sorrow-pm-narendra-modi-1253222">&lt;b>&amp;#39;Not Just GST Stuck In Parliament. Matter of Sorrow&amp;#39;: PM &lt;b>Narendra Modi&lt;/b>&lt;/b>&lt;/a>&lt;br>&lt;font size="-1">&lt;b>&lt;font color="#6f6f6f">NDTV&lt;/font>&lt;/b>&lt;/font>&lt;br>&lt;font size="-1">With repeated disruptions stalling legislation including the GST or Goods and Services Tax, Prime Minister &lt;b>Narendra Modi&lt;/b> today said it was a &amp;quot;matter of sorrow&amp;quot; that Parliament was not running. &amp;quot;It is not only GST, but many pro-poor steps are stuck in&amp;nbsp;...&lt;/font>&lt;br>&lt;font size="-1">&lt;a href="http://news.google.com/news/url?sa=t&amp;amp;fd=R&amp;amp;ct2=in&amp;amp;usg=AFQjCNEVhO7UtISsITzRIFwxTVFwK8BTDQ&amp;amp;clid=c3a7d30bb8a4878e06b80cf16b898331&amp;amp;cid=52779006372283&amp;amp;ei=ACdpVoDJO9Sj4ALYkL94&amp;amp;url=http://www.india.com/news/india/narendra-modis-stern-message-to-congress-democracy-cannot-run-on-whims-of-some-773082/">&lt;b>Narendra Modi&amp;#39;s&lt;/b> stern message to Congress: Democracy cannot run on whims of some&lt;/a>&lt;font size="-1" color="#6f6f6f">&lt;nobr>India.com&lt;/nobr>&lt;/font>&lt;/font>&lt;br>&lt;font size="-1">&lt;a href="http://news.google.com/news/url?sa=t&amp;amp;fd=R&amp;amp;ct2=in&amp;amp;usg=AFQjCNGkBqqpn2OhEI6w68lLCIXMDppu-Q&amp;amp;clid=c3a7d30bb8a4878e06b80cf16b898331&amp;amp;cid=52779006372283&amp;amp;ei=ACdpVoDJO9Sj4ALYkL94&amp;amp;url=http://www.mid-day.com/articles/jagran-forum-catch-pm-narendra-modi-other-leaders-live/16757192">Jagran Forum: Catch PM &lt;b>Narendra Modi&lt;/b>, other leaders live&lt;/a>&lt;font size="-1" color="#6f6f6f">&lt;nobr>Mid-Day&lt;/nobr>&lt;/font>&lt;/font>&lt;br>&lt;font size="-1">&lt;a href="http://news.google.com/news/url?sa=t&amp;amp;fd=R&amp;amp;ct2=in&amp;amp;usg=AFQjCNHPkB8Wy_-cDqqZrdfcn1cVUKP-Kg&amp;amp;clid=c3a7d30bb8a4878e06b80cf16b898331&amp;amp;cid=52779006372283&amp;amp;ei=ACdpVoDJO9Sj4ALYkL94&amp;amp;url=http://www.oneindia.com/india/democracy-cant-be-restricted-to-elections-only-narendra-modi-1951641.html">Democracy can&amp;#39;t be restricted to elections only, says &lt;b>Narendra Modi&lt;/b>&lt;/a>&lt;font size="-1" color="#6f6f6f">&lt;nobr>Oneindia&lt;/nobr>&lt;/font>&lt;/font>&lt;br>&lt;font size="-1" class="p">&lt;a href="http://news.google.com/news/url?sa=t&amp;amp;fd=R&amp;amp;ct2=in&amp;amp;usg=AFQjCNFhxDKEsImpQqu0GccMt4MCiPydVw&amp;amp;clid=c3a7d30bb8a4878e06b80cf16b898331&amp;amp;cid=52779006372283&amp;amp;ei=ACdpVoDJO9Sj4ALYkL94&amp;amp;url=http://www.abplive.in/india-news/everyone-must-feel-he-or-she-is-working-for-indias-progress-says-narendra-modi-258229">&lt;nobr>ABP Live&lt;/nobr>&lt;/a>&lt;/font>&lt;br>&lt;font class="p" size="-1">&lt;a class="p" href="http://news.google.com/news/more?ncl=dac7xEJd70rfdkM8gcjOwSJn8BK9M&amp;amp;authuser=0&amp;amp;ned=in">&lt;nobr>&lt;b>all 29 news articles&amp;nbsp;&amp;raquo;&lt;/b>&lt;/nobr>&lt;/a>&lt;/font>&lt;/div>&lt;/font>&lt;/td>&lt;/tr>&lt;/table></content>
</entry>

对于图像,我尝试了以下 jsoup 代码:

Elements img = doc.getElementsByTag("img");
         for (Element el : img) {
             System.out.println("Image Found!");
             System.out.println("src attribute is : "+el.attr("src"));
         }

但它什么也没有返回。我也不知道如何继续提取描述:

&lt;br>&lt;font size="-1">NEW DELHI: Putting the Ufa process back on track India and Pakistan on Wednesday signaled process of reducing tensions by announcing Comprehensive Bilateral Dialogue to be led by Foreign Secretaries and prepared the ground for a visit by Prime&amp;nbsp;...&lt;/font>

请帮我解决这个问题。

最佳答案

试试这段代码。请注意,RSS 提要是直接使用 Jsoup 获取的。

Document news = Jsoup.connect("http://news.google.com/news/section?output=atom&ned=in&q=narendra%20modi").get();

int i=0;
for (Element entryContent : news.select("entry > content")) {
    System.out.format("\n## ENTRY %d\n", ++i);
    for (Element el : Jsoup.parse(entryContent.text()).select("img[src], tr td.j font[size]:nth-of-type(2)")) {

        String elementTagName = el.tagName();  

        if (elementTagName.equalsIgnoreCase("img")) {
            System.out.println("src attribute is : " + el.attr("src"));
        } else if (elementTagName.equalsIgnoreCase("font")) {
            System.out.println("description is : " + el.text());
        } else {
            System.out.println("Unexpected element >> " + el.html());
        }
    }
}

示例输出

## ENTRY 1
src attribute is : //t0.gstatic.com/images?q=tbn:ANd9GcSLee4ulBtCEOMSuDuLHCAjDZwmlaVaXJVdC09133QbK3X1OpZH3s1RBplznEadxqV5memM0dh3
description is : With repeated disruptions stalling legislation including the GST or Goods and Services Tax, Prime Minister Narendra Modi today said it was a "matter of sorrow" that Parliament was not running. "It is not only GST, but many pro-poor steps are stuck in ...

## ENTRY 2
src attribute is : //t1.gstatic.com/images?q=tbn:ANd9GcQdJPtLOBi9F2Ktov11_x5kqHC4inID47xKD3we_ZC5rHP1Lps96sYHs_N0pBO9WkDj5KKuEa8
description is : Prime Minister Narendra Modi topped the charts of Facebook under the most-viewed

(...)

在 JSoup 1.8.3 上测试

关于java - 如何使用Jsoup从html数据中获取图片来源和描述,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/34201254/

相关文章:

java - 编写一个程序,使用 "e"方法在一个字符串中交换字母 "o"和 `replace`

JavaFX 和 ProgressIndicator 以及 fxml

java - 如何将 java 连接到 Ms Access 2010?

javascript - IE 11 "Crashes"使用动态 SVG 元素时

javascript - 删除父 TD Jquery 中的表元素

java - 如何为 Java JTree 设计外观,使其看起来像用于 xml 文档?

javascript - SVG HTML5 仅在特定实例/用法中更改模式属性

android - 发布 JSOUP 文档以登录网站

java - GUI 与 Jsoup 一起使用

java - Android 版 Jsoup 帮助 - 如何从表格中获取元素中的文本?