java - 如何使用 Jsoup 提取 HTML 的单独部分?

标签 java android html jsoup

我已经使用了一些 Jsoup 方法来获取包含网页 HTML 代码一部分的字符串:

protected String doInBackground(String... arguments) {
        // extract arguments
        String newsurl = arguments[0];
        //
        Document doc = null;
        try {
            doc = Jsoup.connect(newsurl).get();
        } catch (IOException e) {
            e.printStackTrace();
        } catch (NullPointerException e) {
            e.printStackTrace();
        }
        if (doc != null) {
            Elements myElements = doc.getElementsByClass("news_list");

            string1 = myElements.toString();
            Log.i("ELEMENTS HTML", string1);
        } else {
            string1 = "FAILED";
        }
        return string1;

    }

但是,我真的找不到可以将 HTML 文件进一步划分为来自 Elements 类的可字符串化部分的方法。我感觉我的方法不对。

我想使用的 HTML 部分如下所示:

<table class="news_list" cellspacing="0" cellpadding="0" border="0" id="ctl00_cphInnerPage_cntrlNewsList_gvNews" style="border-width:0px;width:100%;border-collapse:collapse;">
    <tr>
        <td>
                <table cellpadding="0" cellspacing="0" width="100%" border="0">
                    <tr>
                        <td>
                            <div class="news_list_image" style="float:left; " >
                                <a id="ctl00_cphInnerPage_cntrlNewsList_gvNews_ctl02_lnkNewsImg" class="wborder" href="/NewsDetails.aspx?cat_id=1&amp;news_id=462"><img src="/mc_newsdata/photos/635254712252165967_thumb.jpg" style="border-width:0px;" /></a>                                
                            </div>
                            <div>
                                <div>
                                    <a id="ctl00_cphInnerPage_cntrlNewsList_gvNews_ctl02_lnkDatePosted" class="home_date sitelink" href="/NewsDetails.aspx?cat_id=1&amp;news_id=462">1/16/2014</a>
                                </div>
                                <div>
                                    <a id="ctl00_cphInnerPage_cntrlNewsList_gvNews_ctl02_lnkTitle" class="home_title sitelink" href="/NewsDetails.aspx?cat_id=1&amp;news_id=462">Science Fair</a>
                                </div>
                                <div class="summary">
                                    The annual Science Fair of the American College of Sofia took place on Wednesday, January 15. You could see photos of some of the incredible projects and experiments in our photo gallery.
                                </div>
                                <div style="text-align:right;">
                                    <a id="ctl00_cphInnerPage_cntrlNewsList_gvNews_ctl02_hplReadMore" class="more sitelink" href="/NewsDetails.aspx?cat_id=1&amp;news_id=462">more<img id="ctl00_cphInnerPage_cntrlNewsList_gvNews_ctl02_imgMore" src="/App_Themes/Default/images/more_new.gif" style="height:8px;width:8px;border-width:0px;" /></a>
                                </div>
                            </div>
                        </td>
                    </tr>                        
                </table>                                                        
            </td>
    </tr><tr>
        <td>
                <table cellpadding="0" cellspacing="0" width="100%" border="0">
                    <tr>
                        <td>
                            <div class="news_list_image" style="float:left; " >
                                <a id="ctl00_cphInnerPage_cntrlNewsList_gvNews_ctl03_lnkNewsImg" class="wborder" href="/NewsDetails.aspx?cat_id=1&amp;news_id=461"></a>                                
                            </div>
                            <div>
                                <div>
                                    <a id="ctl00_cphInnerPage_cntrlNewsList_gvNews_ctl03_lnkDatePosted" class="home_date sitelink" href="/NewsDetails.aspx?cat_id=1&amp;news_id=461">1/10/2014</a>
                                </div>
                                <div>
                                    <a id="ctl00_cphInnerPage_cntrlNewsList_gvNews_ctl03_lnkTitle" class="home_title sitelink" href="/NewsDetails.aspx?cat_id=1&amp;news_id=461">ACS Students’ Results from PISA 2012</a>
                                </div>
                                <div class="summary">
                                    ACS recently received the official results of our students&rsquo; performance at the Programme for International Student Assessment (PISA) 2012. PISA is a triennial international survey developed by the Organisation for Economic Co-operation and Development (OECD) that takes place since 2000. It evaluates education systems worldwide by testing the skills and knowledge of 15-16-year-old students in the key subjects: reading, mathematics and science, with a focus on one subject in each year of assessment. In 2012, the assessment focused on students&rsquo; knowledge in mathematics. 
                                </div>
                                <div style="text-align:right;">
                                    <a id="ctl00_cphInnerPage_cntrlNewsList_gvNews_ctl03_hplReadMore" class="more sitelink" href="/NewsDetails.aspx?cat_id=1&amp;news_id=461">more<img id="ctl00_cphInnerPage_cntrlNewsList_gvNews_ctl03_imgMore" src="/App_Themes/Default/images/more_new.gif" style="height:8px;width:8px;border-width:0px;" /></a>
                                </div>
                            </div>
                        </td>
                    </tr>                        
                </table>                                                        
            </td>
    </tr><tr>
        <td>
                <table cellpadding="0" cellspacing="0" width="100%" border="0">
                    <tr>
                        <td>
                            <div class="news_list_image" style="float:left; " >
                                <a id="ctl00_cphInnerPage_cntrlNewsList_gvNews_ctl04_lnkNewsImg" class="wborder" href="/NewsDetails.aspx?cat_id=1&amp;news_id=458"></a>                                
                            </div>
                            <div>
                                <div>
                                    <a id="ctl00_cphInnerPage_cntrlNewsList_gvNews_ctl04_lnkDatePosted" class="home_date sitelink" href="/NewsDetails.aspx?cat_id=1&amp;news_id=458">12/20/2013</a>
                                </div>
                                <div>
                                    <a id="ctl00_cphInnerPage_cntrlNewsList_gvNews_ctl04_lnkTitle" class="home_title sitelink" href="/NewsDetails.aspx?cat_id=1&amp;news_id=458">PHOTOS FROM THE CHRISTMAS CONCERT AND THE ALUMNI RECEPTION</a>
                                </div>
                                <div class="summary">
                                    You can see some great photos from the amazing Annual Christmas Concert taken by Konstantin Karchev from 11 Grade, as well as some photos from the Alumni Reception by visiting the photogallery of the website.
                                </div>
                                <div style="text-align:right;">
                                    <a id="ctl00_cphInnerPage_cntrlNewsList_gvNews_ctl04_hplReadMore" class="more sitelink" href="/NewsDetails.aspx?cat_id=1&amp;news_id=458">more<img id="ctl00_cphInnerPage_cntrlNewsList_gvNews_ctl04_imgMore" src="/App_Themes/Default/images/more_new.gif" style="height:8px;width:8px;border-width:0px;" /></a>
                                </div>
                            </div>
                        </td>
                    </tr>                        
                </table>                                                        
            </td>
    </tr><tr>
        <td>
                <table cellpadding="0" cellspacing="0" width="100%" border="0">
                    <tr>
                        <td>
                            <div class="news_list_image" style="float:left; " >
                                <a id="ctl00_cphInnerPage_cntrlNewsList_gvNews_ctl05_lnkNewsImg" class="wborder" href="/NewsDetails.aspx?cat_id=1&amp;news_id=457"></a>                                
                            </div>
                            <div>
                                <div>
                                    <a id="ctl00_cphInnerPage_cntrlNewsList_gvNews_ctl05_lnkDatePosted" class="home_date sitelink" href="/NewsDetails.aspx?cat_id=1&amp;news_id=457">12/19/2013</a>
                                </div>
                                <div>
                                    <a id="ctl00_cphInnerPage_cntrlNewsList_gvNews_ctl05_lnkTitle" class="home_title sitelink" href="/NewsDetails.aspx?cat_id=1&amp;news_id=457">THREE ACS MEDAL-WINNERS MEET PRESIDENT PLEVNELIEV</a>
                                </div>
                                <div class="summary">
                                    On December 16, the Third Olympic Meeting of the members of the national student science teams with Bulgarian President Rosen Plevneliev took place. Three ACS students, well-known in the ACS community for their successes in science, were among the invited: Viktor Kouzmanov 12/4, Konstantin Karchev 11/4, and Mihaela Zaharieva from the Class of 2013. 
                                </div>
                                <div style="text-align:right;">
                                    <a id="ctl00_cphInnerPage_cntrlNewsList_gvNews_ctl05_hplReadMore" class="more sitelink" href="/NewsDetails.aspx?cat_id=1&amp;news_id=457">more<img id="ctl00_cphInnerPage_cntrlNewsList_gvNews_ctl05_imgMore" src="/App_Themes/Default/images/more_new.gif" style="height:8px;width:8px;border-width:0px;" /></a>
                                </div>
                            </div>
                        </td>
                    </tr>                        
                </table>                                                        
            </td>
    </tr><tr>
        <td>
                <table cellpadding="0" cellspacing="0" width="100%" border="0">
                    <tr>
                        <td>
                            <div class="news_list_image" style="float:left; " >
                                <a id="ctl00_cphInnerPage_cntrlNewsList_gvNews_ctl06_lnkNewsImg" class="wborder" href="/NewsDetails.aspx?cat_id=1&amp;news_id=456"><img src="/mc_newsdata/photos/635228847467352694_thumb.jpg" style="border-width:0px;" /></a>                                
                            </div>
                            <div>
                                <div>
                                    <a id="ctl00_cphInnerPage_cntrlNewsList_gvNews_ctl06_lnkDatePosted" class="home_date sitelink" href="/NewsDetails.aspx?cat_id=1&amp;news_id=456">12/17/2013</a>
                                </div>
                                <div>
                                    <a id="ctl00_cphInnerPage_cntrlNewsList_gvNews_ctl06_lnkTitle" class="home_title sitelink" href="/NewsDetails.aspx?cat_id=1&amp;news_id=456">ACS Debaters with an Award from a National Debate Tournament</a>
                                </div>
                                <div class="summary">
                                    This past weekend ACSers from the Debate Club (with faculty advisors Adam Saligman, Milka Getsovska, and Michael Deegan) took part along with students from 14 other schools from all over the country in the first national Bulgarian Forensic League (&ldquo;BFL&rdquo;) tournament of the year. An ACS team consisting of students Adelina Ivanova (11/7), Veselin Nanov (10/2), and Mihail Georgiev (10/7) won the first prize in the &quot;Karl Popper Debate&quot; varsity category, a specific format involving a team of three debating another team of three, all in the age group of Grades 10 to 12. Congratulations to Adelina, Veselin, Mihail, and their faculty advisors for their great achievement!
                                </div>
                                <div style="text-align:right;">
                                    <a id="ctl00_cphInnerPage_cntrlNewsList_gvNews_ctl06_hplReadMore" class="more sitelink" href="/NewsDetails.aspx?cat_id=1&amp;news_id=456">more<img id="ctl00_cphInnerPage_cntrlNewsList_gvNews_ctl06_imgMore" src="/App_Themes/Default/images/more_new.gif" style="height:8px;width:8px;border-width:0px;" /></a>
                                </div>
                            </div>
                        </td>
                    </tr>                        
                </table>                                                        
            </td>
    </tr><tr>
        <td>
                <table cellpadding="0" cellspacing="0" width="100%" border="0">
                    <tr>
                        <td>
                            <div class="news_list_image" style="float:left; " >
                                <a id="ctl00_cphInnerPage_cntrlNewsList_gvNews_ctl07_lnkNewsImg" class="wborder" href="/NewsDetails.aspx?cat_id=1&amp;news_id=455"></a>                                
                            </div>
                            <div>
                                <div>
                                    <a id="ctl00_cphInnerPage_cntrlNewsList_gvNews_ctl07_lnkDatePosted" class="home_date sitelink" href="/NewsDetails.aspx?cat_id=1&amp;news_id=455">12/13/2013</a>
                                </div>
                                <div>
                                    <a id="ctl00_cphInnerPage_cntrlNewsList_gvNews_ctl07_lnkTitle" class="home_title sitelink" href="/NewsDetails.aspx?cat_id=1&amp;news_id=455">ACS Senior Victorious at an International Physics Olympiad</a>
                                </div>
                                <div class="summary">
                                    Last Saturday, the Bulgarian Physics Team featuring ACS senior Victor Kouzmanov returned with the special Grand Prix team prize, one silver, and two bronze medals from the International Experimental Physics Olympiad held in Moscow November 27 through December 6. Congratulations and lots of success for the future to Victor and his teammates!
                                </div>
                                <div style="text-align:right;">
                                    <a id="ctl00_cphInnerPage_cntrlNewsList_gvNews_ctl07_hplReadMore" class="more sitelink" href="/NewsDetails.aspx?cat_id=1&amp;news_id=455">more<img id="ctl00_cphInnerPage_cntrlNewsList_gvNews_ctl07_imgMore" src="/App_Themes/Default/images/more_new.gif" style="height:8px;width:8px;border-width:0px;" /></a>
                                </div>
                            </div>
                        </td>
                    </tr>                        
                </table>                                                        
            </td>
    </tr><tr>
        <td>
                <table cellpadding="0" cellspacing="0" width="100%" border="0">
                    <tr>
                        <td>
                            <div class="news_list_image" style="float:left; " >
                                <a id="ctl00_cphInnerPage_cntrlNewsList_gvNews_ctl08_lnkNewsImg" class="wborder" href="/NewsDetails.aspx?cat_id=1&amp;news_id=453"></a>                                
                            </div>
                            <div>
                                <div>
                                    <a id="ctl00_cphInnerPage_cntrlNewsList_gvNews_ctl08_lnkDatePosted" class="home_date sitelink" href="/NewsDetails.aspx?cat_id=1&amp;news_id=453">12/4/2013</a>
                                </div>
                                <div>
                                    <a id="ctl00_cphInnerPage_cntrlNewsList_gvNews_ctl08_lnkTitle" class="home_title sitelink" href="/NewsDetails.aspx?cat_id=1&amp;news_id=453">ACS Alumnus Won a Prestigious Trading Competition in the US</a>
                                </div>
                                <div class="summary">
                                    Congratulations to Kubrat Danailov of the ACS Class of 2011 on winning the prestigious Intercollegiate Trading Competition held in Boston, USA last month after competing with 100 other students from some of the best universities in the USA - MIT, Harvard, UPenn, Princeton, Yale, Columbia, Cornell, UChicago, Wellesley, Baruch, NYU, and Boston University.
                                </div>
                                <div style="text-align:right;">
                                    <a id="ctl00_cphInnerPage_cntrlNewsList_gvNews_ctl08_hplReadMore" class="more sitelink" href="/NewsDetails.aspx?cat_id=1&amp;news_id=453">more<img id="ctl00_cphInnerPage_cntrlNewsList_gvNews_ctl08_imgMore" src="/App_Themes/Default/images/more_new.gif" style="height:8px;width:8px;border-width:0px;" /></a>
                                </div>
                            </div>
                        </td>
                    </tr>                        
                </table>                                                        
            </td>
    </tr><tr>
        <td>
                <table cellpadding="0" cellspacing="0" width="100%" border="0">
                    <tr>
                        <td>
                            <div class="news_list_image" style="float:left; " >
                                <a id="ctl00_cphInnerPage_cntrlNewsList_gvNews_ctl09_lnkNewsImg" class="wborder" href="/NewsDetails.aspx?cat_id=1&amp;news_id=452"><img src="/mc_newsdata/photos/635210613621441367_thumb.jpg" style="border-width:0px;" /></a>                                
                            </div>
                            <div>
                                <div>
                                    <a id="ctl00_cphInnerPage_cntrlNewsList_gvNews_ctl09_lnkDatePosted" class="home_date sitelink" href="/NewsDetails.aspx?cat_id=1&amp;news_id=452">11/26/2013</a>
                                </div>
                                <div>
                                    <a id="ctl00_cphInnerPage_cntrlNewsList_gvNews_ctl09_lnkTitle" class="home_title sitelink" href="/NewsDetails.aspx?cat_id=1&amp;news_id=452">ACS OPEN VOLLEYBALL TOURNAMENT RESULTS</a>
                                </div>
                                <div class="summary">
                                    The ACS OPEN Volleyball Tournament 2013 took place between Nov 18 through 24. <br/><br/>Below you can see the final standings for boys and girls, as well as the MVP awards winners:
                                </div>
                                <div style="text-align:right;">
                                    <a id="ctl00_cphInnerPage_cntrlNewsList_gvNews_ctl09_hplReadMore" class="more sitelink" href="/NewsDetails.aspx?cat_id=1&amp;news_id=452">more<img id="ctl00_cphInnerPage_cntrlNewsList_gvNews_ctl09_imgMore" src="/App_Themes/Default/images/more_new.gif" style="height:8px;width:8px;border-width:0px;" /></a>
                                </div>
                            </div>
                        </td>
                    </tr>                        
                </table>                                                        
            </td>
    </tr><tr>
        <td>
                <table cellpadding="0" cellspacing="0" width="100%" border="0">
                    <tr>
                        <td>
                            <div class="news_list_image" style="float:left; " >
                                <a id="ctl00_cphInnerPage_cntrlNewsList_gvNews_ctl10_lnkNewsImg" class="wborder" href="/NewsDetails.aspx?cat_id=1&amp;news_id=451"><img src="/mc_newsdata/photos/635209646534734186_thumb.jpg" style="border-width:0px;" /></a>                                
                            </div>
                            <div>
                                <div>
                                    <a id="ctl00_cphInnerPage_cntrlNewsList_gvNews_ctl10_lnkDatePosted" class="home_date sitelink" href="/NewsDetails.aspx?cat_id=1&amp;news_id=451">11/22/2013</a>
                                </div>
                                <div>
                                    <a id="ctl00_cphInnerPage_cntrlNewsList_gvNews_ctl10_lnkTitle" class="home_title sitelink" href="/NewsDetails.aspx?cat_id=1&amp;news_id=451">ART EXHIBITION</a>
                                </div>
                                <div class="summary">
                                    The latest Art Exhibition is posted in the Art Gallery in Sanders Hall. It shows works of ACS students drawn in the elective Art classes. <br/>
                                </div>
                                <div style="text-align:right;">
                                    <a id="ctl00_cphInnerPage_cntrlNewsList_gvNews_ctl10_hplReadMore" class="more sitelink" href="/NewsDetails.aspx?cat_id=1&amp;news_id=451">more<img id="ctl00_cphInnerPage_cntrlNewsList_gvNews_ctl10_imgMore" src="/App_Themes/Default/images/more_new.gif" style="height:8px;width:8px;border-width:0px;" /></a>
                                </div>
                            </div>
                        </td>
                    </tr>                        
                </table>                                                        
            </td>
    </tr><tr>
        <td>
                <table cellpadding="0" cellspacing="0" width="100%" border="0">
                    <tr>
                        <td>
                            <div class="news_list_image" style="float:left; " >
                                <a id="ctl00_cphInnerPage_cntrlNewsList_gvNews_ctl11_lnkNewsImg" class="wborder" href="/NewsDetails.aspx?cat_id=1&amp;news_id=449"><img src="/mc_newsdata/photos/635204602267379593_thumb.jpg" style="border-width:0px;" /></a>                                
                            </div>
                            <div>
                                <div>
                                    <a id="ctl00_cphInnerPage_cntrlNewsList_gvNews_ctl11_lnkDatePosted" class="home_date sitelink" href="/NewsDetails.aspx?cat_id=1&amp;news_id=449">11/19/2013</a>
                                </div>
                                <div>
                                    <a id="ctl00_cphInnerPage_cntrlNewsList_gvNews_ctl11_lnkTitle" class="home_title sitelink" href="/NewsDetails.aspx?cat_id=1&amp;news_id=449">Day of Tolerance at ACS</a>
                                </div>
                                <div class="summary">
                                    Today, November 19, ACS Club Embrace is organizing a series of events to mark the International Day of Tolerance celebrated on November 16 since 1995. After the discussion held during advisory periods and the lunch happening at Ostrander Foyer (see photo) the event will be marked by a screening at 3:30 PM of short movies dedicated to the subject of tolerance. All members of the ACS community are welcome to see the thought-provoking short movies!
                                </div>
                                <div style="text-align:right;">
                                    <a id="ctl00_cphInnerPage_cntrlNewsList_gvNews_ctl11_hplReadMore" class="more sitelink" href="/NewsDetails.aspx?cat_id=1&amp;news_id=449">more<img id="ctl00_cphInnerPage_cntrlNewsList_gvNews_ctl11_imgMore" src="/App_Themes/Default/images/more_new.gif" style="height:8px;width:8px;border-width:0px;" /></a>
                                </div>
                            </div>
                        </td>
                    </tr>                        
                </table>                                                        
            </td>
    </tr>

</table>

我想提取每条新闻的标题、日期、链接和内容并将其分发到数组/字符串中,并获取图像的链接。

提前感谢您的帮助!!

编辑: 我突然想到,这些信息节点中的每一个都有其独特的类名,理论上我可以通过它进行搜索。但是Elements类没有类似于GetElementsByClass的类。

最佳答案

您可以使用 getElementsByTag,因为您知道子元素是什么。在这种情况下,您需要处理所有具有所需值的子表:

因此,将您的 Elements 更改为:

Elements myElements = doc.getElementsByClass("news_list").first().getElementsByTag("table");

现在遍历每个元素以获取您的各个元素:

for (Element el : myElements) {

                Element title = el.getElementsByClass("home_title").first();
                Element date = el.getElementsByClass("home_date").first();
                Element link = el.getElementsByClass("news_list_image").first();

                System.out.println(title.text());
                System.out.println(date.text());
                System.out.println(link.child(0).attr("href"));
                System.out.println();

            }

值(value)观:

Science Fair
1/16/2014
/NewsDetails.aspx?cat_id=1&news_id=462

ACS Students’ Results from PISA 2012
1/10/2014
/NewsDetails.aspx?cat_id=1&news_id=461

PHOTOS FROM THE CHRISTMAS CONCERT AND THE ALUMNI RECEPTION
12/20/2013
/NewsDetails.aspx?cat_id=1&news_id=458

关于java - 如何使用 Jsoup 提取 HTML 的单独部分?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/21511004/

相关文章:

android - 通过使用 Intent 在 Activity 之间传递值

java - 使用 picasso android 库加载图像形式 URL 时出现错误 HTTP 504

javascript - 我需要使用保存颜色 ID 的按钮来记录输出

javascript - Packery 网格布局 - 边框/填充和覆盖问题

java - Spring 4,@Resource注解字段为null

java - 为什么 Java Graphics2D 'drawString' 停止 repaint()?

java - 我如何在 Java 中解密 openssl?

android - 移除 android 中 Tabs 布局的阴影。 API >= 21

html - 如何在html中的多个翻转卡中添加网格

java - 如何在 Spring Boot 项目 : 上正确配置 Intellij IDEA 中的 jRebel