我有一个来自消息应用程序的消息日志,存储为 HTML。在该文件中,一条消息按以下方式呈现:
<div class="message">
<div class="message_header">
<span class="user">User Name</span>
<span class="meta">10 february 2018 at 16:17 UTC+01</span>
</div>
<p>Message content</p>
</div>
文件中的消息排列得不太好 - 每行可能有多条消息,有时这些行在消息的中间结束。
我想创建一个 Message
类的实例,其中包含 userName
、date
和 messageContent
等字段对于文件中的每个项目。有什么优雅的方法可以做到这一点吗?
我计划迭代该文件并在每次新消息开始时分割每一行,然后从字符串中获取数据,但如果有一种不那么繁琐的方法,我宁愿避免它。
最佳答案
我的答案可能对这个问题的作者没有用(我迟到了 5 个月,所以我猜不是正确的时机),但我认为它可能对许多其他可能遇到这个答案的开发人员有用。
今天,我刚刚发布了(以我公司的名义)一个 HTML 到 POJO 的完整框架,您可以使用它通过简单的一些注释将 HTML 映射到任何 POJO 类。该库本身非常方便,并且具有许多其他功能,同时非常可插拔。您可以在这里查看:https://github.com/whimtrip/jwht-htmltopojo
如何使用:基础知识
假设我们需要解析以下 html 页面:
<html>
<head>
<title>A Simple HTML Document</title>
</head>
<body>
<div class="restaurant">
<h1>A la bonne Franquette</h1>
<p>French cuisine restaurant for gourmet of fellow french people</p>
<div class="location">
<p>in <span>London</span></p>
</div>
<p>Restaurant n*18,190. Ranked 113 out of 1,550 restaurants</p>
<div class="meals">
<div class="meal">
<p>Veal Cutlet</p>
<p rating-color="green">4.5/5 stars</p>
<p>Chef Mr. Frenchie</p>
</div>
<div class="meal">
<p>Ratatouille</p>
<p rating-color="orange">3.6/5 stars</p>
<p>Chef Mr. Frenchie and Mme. French-Cuisine</p>
</div>
</div>
</div>
</body>
</html>
让我们创建我们想要将其映射到的 POJO:
public class Restaurant {
@Selector( value = "div.restaurant > h1")
private String name;
@Selector( value = "div.restaurant > p:nth-child(2)")
private String description;
@Selector( value = "div.restaurant > div:nth-child(3) > p > span")
private String location;
@Selector(
value = "div.restaurant > p:nth-child(4)"
format = "^Restaurant n\*([0-9,]+). Ranked ([0-9,]+) out of ([0-9,]+) restaurants$",
indexForRegexPattern = 1,
useDeserializer = true,
deserializer = ReplacerDeserializer.class,
preConvert = true,
postConvert = false
)
// so that the number becomes a valid number as they are shown in this format : 18,190
@ReplaceWith(value = ",", with = "")
private Long id;
@Selector(
value = "div.restaurant > p:nth-child(4)"
format = "^Restaurant n\*([0-9,]+). Ranked ([0-9,]+) out of ([0-9,]+) restaurants$",
// This time, we want the second regex group and not the first one anymore
indexForRegexPattern = 2,
useDeserializer = true,
deserializer = ReplacerDeserializer.class,
preConvert = true,
postConvert = false
)
// so that the number becomes a valid number as they are shown in this format : 18,190
@ReplaceWith(value = ",", with = "")
private Integer rank;
@Selector(value = ".meal")
private List<Meal> meals;
// getters and setters
}
现在还有 Meal
类:
public class Meal {
@Selector(value = "p:nth-child(1)")
private String name;
@Selector(
value = "p:nth-child(2)",
format = "^([0-9.]+)\/5 stars$",
indexForRegexPattern = 1
)
private Float stars;
@Selector(
value = "p:nth-child(2)",
// rating-color custom attribute can be used as well
attr = "rating-color"
)
private String ratingColor;
@Selector(
value = "p:nth-child(3)"
)
private String chefs;
// getters and setters.
}
我们在 github 页面上对上述代码提供了更多解释。
现在,让我们看看如何废弃它。
private static final String MY_HTML_FILE = "my-html-file.html";
public static void main(String[] args) {
HtmlToPojoEngine htmlToPojoEngine = HtmlToPojoEngine.create();
HtmlAdapter<Restaurant> adapter = htmlToPojoEngine.adapter(Restaurant.class);
// If they were several restaurants in the same page,
// you would need to create a parent POJO containing
// a list of Restaurants as shown with the meals here
Restaurant restaurant = adapter.fromHtml(getHtmlBody());
// That's it, do some magic now!
}
private static String getHtmlBody() throws IOException {
byte[] encoded = Files.readAllBytes(Paths.get(MY_HTML_FILE));
return new String(encoded, Charset.forName("UTF-8"));
}
可以找到另一个简短的示例 here
希望这能帮助那里的人!
关于java - 将数据从 HTML 提取到 Java 对象,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/48786494/