java - java中的html截断器

标签 java html parsing truncate

是否有任何实用程序(或示例源代码)可以在 Java 中截断 HTML(用于预览)?我想在服务器上而不是在客户端上进行截断。

我正在使用 HTMLUnit 来解析 HTML。

我希望能够预览 HTML,因此截断器将保持 HTML 结构,同时在所需的输出长度之后剥离元素。


我已经编写了另一个 java 版本的 truncateHTML。此函数将字符串截断为多个字符,同时保留整个单词和 HTML 标记。

public static String truncateHTML(String text, int length, String suffix) {
    // if the plain text is shorter than the maximum length, return the whole text
    if (text.replaceAll("<.*?>", "").length() <= length) {
        return text;
    String result = "";
    boolean trimmed = false;
    if (suffix == null) {
        suffix = "...";

     * This pattern creates tokens, where each line starts with the tag.
     * For example, "One, <b>Two</b>, Three" produces the following:
     *     One,
     *     <b>Two
     *     </b>, Three
    Pattern tagPattern = Pattern.compile("(<.+?>)?([^<>]*)");

     * Checks for an empty tag, for example img, br, etc.
    Pattern emptyTagPattern = Pattern.compile("^<\\s*(img|br|input|hr|area|base|basefont|col|frame|isindex|link|meta|param).*>$");

     * Modified the pattern to also include H1-H6 tags
     * Checks for closing tags, allowing leading and ending space inside the brackets
    Pattern closingTagPattern = Pattern.compile("^<\\s*/\\s*([a-zA-Z]+[1-6]?)\\s*>$");

     * Modified the pattern to also include H1-H6 tags
     * Checks for opening tags, allowing leading and ending space inside the brackets
    Pattern openingTagPattern = Pattern.compile("^<\\s*([a-zA-Z]+[1-6]?).*?>$");

     * Find &nbsp; &gt; ...
    Pattern entityPattern = Pattern.compile("(&[0-9a-z]{2,8};|&#[0-9]{1,7};|[0-9a-f]{1,6};)");

    // splits all html-tags to scanable lines
    Matcher tagMatcher =  tagPattern.matcher(text);
    int numTags = tagMatcher.groupCount();

    int totalLength = suffix.length();
    List<String> openTags = new ArrayList<String>();

    boolean proposingChop = false;
    while (tagMatcher.find()) {
        String tagText =;
        String plainText =;

        if (proposingChop &&
                tagText != null && tagText.length() != 0 &&
                plainText != null && plainText.length() != 0) {
            trimmed = true;

        // if there is any html-tag in this line, handle it and add it (uncounted) to the output
        if (tagText != null && tagText.length() > 0) {
            boolean foundMatch = false;

            // if it's an "empty element" with or without xhtml-conform closing slash
            Matcher matcher = emptyTagPattern.matcher(tagText);
            if (matcher.find()) {
                foundMatch = true;
                // do nothing

            // closing tag?
            if (!foundMatch) {
                matcher = closingTagPattern.matcher(tagText);
                if (matcher.find()) {
                    foundMatch = true;
                    // delete tag from openTags list
                    String tagName =;

            // opening tag?
            if (!foundMatch) {
                matcher = openingTagPattern.matcher(tagText);
                if (matcher.find()) {
                    // add tag to the beginning of openTags list
                    String tagName =;
                    openTags.add(0, tagName.toLowerCase());

            // add html-tag to result
            result += tagText;

        // calculate the length of the plain text part of the line; handle entities (e.g. &nbsp;) as one character
        int contentLength = plainText.replaceAll("&[0-9a-z]{2,8};|&#[0-9]{1,7};|[0-9a-f]{1,6};", " ").length();
        if (totalLength + contentLength > length) {
            // the number of characters which are left
            int numCharsRemaining = length - totalLength;
            int entitiesLength = 0;
            Matcher entityMatcher = entityPattern.matcher(plainText);
            while (entityMatcher.find()) {
                String entity =;
                if (numCharsRemaining > 0) {
                    entitiesLength += entity.length();
                } else {
                    // no more characters left

            // keep us from chopping words in half
            int proposedChopPosition = numCharsRemaining + entitiesLength;
            int endOfWordPosition = plainText.indexOf(" ", proposedChopPosition-1);
            if (endOfWordPosition == -1) {
                endOfWordPosition = plainText.length();
            int endOfWordOffset = endOfWordPosition - proposedChopPosition;
            if (endOfWordOffset > 6) { // chop the word if it's extra long
                endOfWordOffset = 0;

            proposedChopPosition = numCharsRemaining + entitiesLength + endOfWordOffset;
            if (plainText.length() >= proposedChopPosition) {
                result += plainText.substring(0, proposedChopPosition);
                proposingChop = true;
                if (proposedChopPosition < plainText.length()) {
                    trimmed = true;
                    break; // maximum length is reached, so get off the loop
            } else {
                result += plainText;
        } else {
            result += plainText;
            totalLength += contentLength;
        // if the maximum length is reached, get off the loop
        if(totalLength >= length) {
            trimmed = true;

    for (String openTag : openTags) {
        result += "</" + openTag + ">";
    if (trimmed) {
        result += suffix;
    return result;

关于java - java中的html截断器,我们在Stack Overflow上找到一个类似的问题:


java - 在 Java 中使用带有继承的最终列表

集群环境中 Java EE EAR 读/写资源的共享位置

java - ImageView 可以是全局变量吗?

javascript - 单击时菜单内容不会折叠 - Bootstrap

html - 移动浏览器模拟器

html: 正确添加 var 标签的定义。

java.lang.OutOfMemory :PermGen exception in Apache tomcat 内存不足

parsing - 为什么递归下降解析器不能处理左递归


c++ - ZigBee Arduino,解析数据不正确