java - 将 HTML 解析为对象

标签 java jsoup

我尝试使用 jsoup 将以下 html 解析为 Java 中的对象。

我正在尝试遍历元素并将所有“Class”提取为对象以生成时间表数据。每个“类(class)”都有时间、地点、讲师和描述等等,但这不是问题。 所有元素都属于tt_details类。每一天都没有特定的父子关系,但是我可以使用 Elements dayNames = content.getElementsByClass("tt_day"); 提取涉及的日期

每天可以有不同数量的“类(class)”,因为您可以看到星期一有 3 个“类(class)”而星期二有,因此正常的循环结构不起作用。我怎样才能实现这个目标?

<div class='tt_details'>
    <div class='tt_day'>Mon</div>
</div>
<div class='tt_details'>
    <div class='tt_timeslot'>11:00 - 13:00
        <div class='tt_day_small'> (Mon)</div>
    </div>
    <div class='tt_detail'>Internet of Things<br/>E1010 - MAC Lab <br/></div>
    <div class='tt_lecturer'>Loftus, M</div>
</div>
<div class='tt_details'>
    <div class='tt_timeslot'>13:00 - 14:00
        <div class='tt_day_small'> (Mon)</div>
    </div>
    <div class='tt_detail'>Computer Systems & Networking<br/>A0004 - Tiered Lecture Theatre (132) <br/></div>
    <div class='tt_lecturer'>Lang, D</div>
</div>
<div class='tt_details'>
    <div class='tt_timeslot'>16:00 - 18:00
        <div class='tt_day_small'> (Mon)</div>
    </div>
    <div class='tt_detail'>Intro.to Programming L8<br/>D2005 - Computer Laboratory (32) <br/></div>
    <div class='tt_lecturer'>Kinsella,V</div>
</div>
<div class='tt_details'>
    <div class='tt_day'>Tue</div>
</div>
<div class='tt_details'>
    <div class='tt_timeslot'>09:00 - 10:00
        <div class='tt_day_small'> (Tue)</div>
    </div>
    <div class='tt_detail'>Mathematics 2<br/>A0004 - Tiered Lecture Theatre (132) <br/></div>
    <div class='tt_lecturer'>O'Regan,D</div>
</div>
<div class='tt_details'>
    <div class='tt_timeslot'>10:00 - 11:00
        <div class='tt_day_small'> (Tue)</div>
    </div>
    <div class='tt_detail'>Mathematics 2<br/>E0017 - Tiered Classroom (106) <br/></div>
    <div class='tt_lecturer'>O'Regan,D</div>
</div>
<div class='tt_details'>
    <div class='tt_timeslot'>11:00 - 12:00
        <div class='tt_day_small'> (Tue)</div>
    </div>
    <div class='tt_detail'>Intro to Programming<br/>A0006 - Tiered Lecture Theatre (152) <br/></div>
    <div class='tt_lecturer'>Kinsella,V</div>
</div>
<div class='tt_details'>
    <div class='tt_timeslot'>16:00 - 17:00
        <div class='tt_day_small'> (Tue)</div>
    </div>
    <div class='tt_detail'>Computer Systems & Networking<br/>A0006 - Tiered Lecture Theatre (152) <br/></div>
    <div class='tt_lecturer'>Lang, D</div>
</div>

最佳答案

这样的事情可能会有所帮助:

String html = ""
        +"<div class='tt_details'>"
        +"    <div class='tt_day'>Mon</div>"
        +"</div>"
        +"<div class='tt_details'>"
        +"    <div class='tt_timeslot'>11:00 - 13:00"
        +"        <div class='tt_day_small'> (Mon)</div>"
        +"    </div>"
        +"    <div class='tt_detail'>Internet of Things<br/>E1010 - MAC Lab <br/></div>"
        +"    <div class='tt_lecturer'>Loftus, M</div>"
        +"</div>"
        +"<div class='tt_details'>"
        +"    <div class='tt_timeslot'>13:00 - 14:00"
        +"        <div class='tt_day_small'> (Mon)</div>"
        +"    </div>"
        +"    <div class='tt_detail'>Computer Systems & Networking<br/>A0004 - Tiered Lecture Theatre (132) <br/></div>"
        +"    <div class='tt_lecturer'>Lang, D</div>"
        +"</div>"
        +"<div class='tt_details'>"
        +"    <div class='tt_timeslot'>16:00 - 18:00"
        +"        <div class='tt_day_small'> (Mon)</div>"
        +"    </div>"
        +"    <div class='tt_detail'>Intro.to Programming L8<br/>D2005 - Computer Laboratory (32) <br/></div>"
        +"    <div class='tt_lecturer'>Kinsella,V</div>"
        +"</div>"
        +"<div class='tt_details'>"
        +"    <div class='tt_day'>Tue</div>"
        +"</div>"
        +"<div class='tt_details'>"
        +"    <div class='tt_timeslot'>09:00 - 10:00"
        +"        <div class='tt_day_small'> (Tue)</div>"
        +"    </div>"
        +"    <div class='tt_detail'>Mathematics 2<br/>A0004 - Tiered Lecture Theatre (132) <br/></div>"
        +"    <div class='tt_lecturer'>O'Regan,D</div>"
        +"</div>"
        +"<div class='tt_details'>"
        +"    <div class='tt_timeslot'>10:00 - 11:00"
        +"        <div class='tt_day_small'> (Tue)</div>"
        +"    </div>"
        +"    <div class='tt_detail'>Mathematics 2<br/>E0017 - Tiered Classroom (106) <br/></div>"
        +"    <div class='tt_lecturer'>O'Regan,D</div>"
        +"</div>"
        +"<div class='tt_details'>"
        +"    <div class='tt_timeslot'>11:00 - 12:00"
        +"        <div class='tt_day_small'> (Tue)</div>"
        +"    </div>"
        +"    <div class='tt_detail'>Intro to Programming<br/>A0006 - Tiered Lecture Theatre (152) <br/></div>"
        +"    <div class='tt_lecturer'>Kinsella,V</div>"
        +"</div>"
        +"<div class='tt_details'>"
        +"    <div class='tt_timeslot'>16:00 - 17:00"
        +"        <div class='tt_day_small'> (Tue)</div>"
        +"    </div>"
        +"    <div class='tt_detail'>Computer Systems & Networking<br/>A0006 - Tiered Lecture Theatre (152) <br/></div>"
        +"    <div class='tt_lecturer'>Lang, D</div>"
        +"</div>"
        ;
Document doc = Jsoup.parse(html);
Elements courseEls = doc.select("div.tt_details:not(:has(div.tt_day))");
class Course{
    public Course(String day, String time, String lecturer, String subject) {
        super();
        this.day = day;
        this.time = time;
        this.lecturer = lecturer;
        this.subject = subject;
    }
    public String day;
    public String time;
    public String lecturer;
    public String subject;

    public String toString(){
        return day + " : "+ time +" : "+ lecturer + " : "+ subject;
    }
}
Map<String,List<Course>> coursesByDay = new HashMap<>();
for (Element courseEl : courseEls){
    Element timeSlotEl = courseEl.select(".tt_timeslot").first();
    String timeSlotStr = timeSlotEl.ownText();
    String dayStr = timeSlotEl.select(".tt_day_small").first().text().trim().replace("(", "").replace(")", "");
    String detailStr = courseEl.select(".tt_detail").first().text();
    String lecturerStr = courseEl.select(".tt_lecturer").first().text();

    Course course = new Course(dayStr, timeSlotStr, lecturerStr, detailStr);
    List<Course> courses = coursesByDay.get(dayStr);
    if (courses == null){
        courses = new ArrayList<>();
        coursesByDay.put(dayStr, courses);
    }
    courses.add(course);
}

//get all courses on Tue
List<Course> courses = coursesByDay.get("Tue");
for (Course c : courses){
    System.out.println(c);
}

这将创建一张包含每日类(class)的 map 。因此, map 键是日期,它包含类(class)对象的列表。

对此的一些评论:

  • 我使用自定义对象来保存类(class)信息
  • 我使用选择器 div.tt_details:not(:has(div.tt_day)) 仅获取类(class) div,而忽略了当天的 div。这是可能的,因为有关当天的信息在类(class) div 中重复。
  • CSS 选择器用于获取详细信息。
  • 请注意 ownText() 和 text() 之间的区别。这用于仅获取没有日期的时间信息。
  • map 会动态填充其内容。

关于java - 将 HTML 解析为对象,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/35322869/

相关文章:

java - JSoup 的替代方案或如何清理空格

java - Jsoup 在列表项内嵌套值

java - 带有 JUnit 的 Selenium 测试用例脚本不适用于 Firefox 驱动程序的 Eclipse

java - Spring MVC Hello World 应用程序不工作

Java中断

java - 状态更改时日期将被删除

java - 使用 JSOUP 抓取网页上的链接

java - Jsoup链接选择

java - MongoOperations 保存方法给出 NoSuchMethodError 异常

java - 在 java 上从 url 解析 pdf。我可以使用 jsoup 吗?