java - 将数据从 HTML 提取到 Java 对象

我有一个来自消息应用程序的消息日志，存储为 HTML。在该文件中，一条消息按以下方式呈现:

<div class="message">
  <div class="message_header">
    <span class="user">User Name</span>
    <span class="meta">10 february 2018 at 16:17 UTC+01</span>
  </div>
  <p>Message content</p>
</div>

文件中的消息排列得不太好 - 每行可能有多条消息，有时这些行在消息的中间结束。

我想创建一个 Message 类的实例，其中包含 userName、date 和 messageContent 等字段对于文件中的每个项目。有什么优雅的方法可以做到这一点吗？

我计划迭代该文件并在每次新消息开始时分割每一行，然后从字符串中获取数据，但如果有一种不那么繁琐的方法，我宁愿避免它。

最佳答案

我的答案可能对这个问题的作者没有用(我迟到了 5 个月，所以我猜不是正确的时机)，但我认为它可能对许多其他可能遇到这个答案的开发人员有用。

今天，我刚刚发布了(以我公司的名义)一个 HTML 到 POJO 的完整框架，您可以使用它通过简单的一些注释将 HTML 映射到任何 POJO 类。该库本身非常方便，并且具有许多其他功能，同时非常可插拔。您可以在这里查看:https://github.com/whimtrip/jwht-htmltopojo

如何使用:基础知识

假设我们需要解析以下 html 页面:

<html>
    <head>
        <title>A Simple HTML Document</title>
    </head>
    <body>
        <div class="restaurant">
            <h1>A la bonne Franquette</h1>
            <p>French cuisine restaurant for gourmet of fellow french people</p>
            <div class="location">
                <p>in <span>London</span></p>
            </div>
            <p>Restaurant n*18,190. Ranked 113 out of 1,550 restaurants</p>  
            <div class="meals">
                <div class="meal">
                    <p>Veal Cutlet</p>
                    <p rating-color="green">4.5/5 stars</p>
                    <p>Chef Mr. Frenchie</p>
                </div>

                <div class="meal">
                    <p>Ratatouille</p>
                    <p rating-color="orange">3.6/5 stars</p>
                    <p>Chef Mr. Frenchie and Mme. French-Cuisine</p>
                </div>

            </div> 
        </div>    
    </body>
</html>

让我们创建我们想要将其映射到的 POJO:

public class Restaurant {

    @Selector( value = "div.restaurant > h1")
    private String name;

    @Selector( value = "div.restaurant > p:nth-child(2)")
    private String description;

    @Selector( value = "div.restaurant > div:nth-child(3) > p > span")    
    private String location;    

    @Selector( 
        value = "div.restaurant > p:nth-child(4)"
        format = "^Restaurant n\*([0-9,]+). Ranked ([0-9,]+) out of ([0-9,]+) restaurants$",
        indexForRegexPattern = 1,
        useDeserializer = true,
        deserializer = ReplacerDeserializer.class,
        preConvert = true,
        postConvert = false
    )
    // so that the number becomes a valid number as they are shown in this format : 18,190
    @ReplaceWith(value = ",", with = "")
    private Long id;

    @Selector( 
        value = "div.restaurant > p:nth-child(4)"
        format = "^Restaurant n\*([0-9,]+). Ranked ([0-9,]+) out of ([0-9,]+) restaurants$",
        // This time, we want the second regex group and not the first one anymore
        indexForRegexPattern = 2,
        useDeserializer = true,
        deserializer = ReplacerDeserializer.class,
        preConvert = true,
        postConvert = false
    )
    // so that the number becomes a valid number as they are shown in this format : 18,190
    @ReplaceWith(value = ",", with = "")
    private Integer rank;

    @Selector(value = ".meal")    
    private List<Meal> meals;

    // getters and setters

}

现在还有 Meal 类:

public class Meal {

    @Selector(value = "p:nth-child(1)")
    private String name;

    @Selector(
        value = "p:nth-child(2)",
        format = "^([0-9.]+)\/5 stars$",
        indexForRegexPattern = 1
    )
    private Float stars;

    @Selector(
        value = "p:nth-child(2)",
        // rating-color custom attribute can be used as well
        attr = "rating-color"
    )
    private String ratingColor;

    @Selector(
        value = "p:nth-child(3)"
    )
    private String chefs;

    // getters and setters.
}

我们在 github 页面上对上述代码提供了更多解释。

现在，让我们看看如何废弃它。

private static final String MY_HTML_FILE = "my-html-file.html";

public static void main(String[] args) {


    HtmlToPojoEngine htmlToPojoEngine = HtmlToPojoEngine.create();

    HtmlAdapter<Restaurant> adapter = htmlToPojoEngine.adapter(Restaurant.class);

    // If they were several restaurants in the same page, 
    // you would need to create a parent POJO containing
    // a list of Restaurants as shown with the meals here
    Restaurant restaurant = adapter.fromHtml(getHtmlBody());

    // That's it, do some magic now!

}


private static String getHtmlBody() throws IOException {
    byte[] encoded = Files.readAllBytes(Paths.get(MY_HTML_FILE));
    return new String(encoded, Charset.forName("UTF-8"));

}

可以找到另一个简短的示例 here

希望这能帮助那里的人!

关于java - 将数据从 HTML 提取到 Java 对象，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/48786494/

java - 将数据从 HTML 提取到 Java 对象

如何使用:基础知识

上一篇：stanford-nlp - Stanley Core NLP 版本 3.9.0 何时会出现在 Maven Central 上？

下一篇：python - 将 BeautifulSoup ResultSet 转换为字符串列表