PHP HTML 截断和 UTF-8

我需要将字符串截断为指定长度，忽略 HTML 标签。我找到了合适的功能here .

所以我对其进行了轻微的更改，添加了缓冲区输入 ob_start();

问题出在 UTF-8 上。如果截断字符串的最后一个符号来自间隔 [ą,č,ę,ė,į,š,ų,ū,ž]，然后我得到替换字符 U+FFFD � 在字符串的末尾。

这是我的代码。您可以复制粘贴并自行尝试:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<title>String truncate</title>
</head>

<?php   

    $html = '<b>Koks nors tekstas</b>. <p>Lietuviškas žodis.</p>';

    $html = html_truncate(27, $html);

    echo $html;

    /* Truncate HTML, close opened tags
    *
    * @param int, maxlength of the string
    * @param string, html       
    * @return $html
    */  
    function html_truncate($maxLength, $html){

        $printedLength = 0;
        $position = 0;
        $tags = array();

        ob_start();

        while ($printedLength < $maxLength && preg_match('{</?([a-z]+)[^>]*>|&#?[a-zA-Z0-9]+;}', $html, $match, PREG_OFFSET_CAPTURE, $position)){

            list($tag, $tagPosition) = $match[0];

            // Print text leading up to the tag.
            $str = substr($html, $position, $tagPosition - $position);
            if ($printedLength + strlen($str) > $maxLength){
                print(substr($str, 0, $maxLength - $printedLength));
                $printedLength = $maxLength;
                break;
            }

            print($str);
            $printedLength += strlen($str);

            if ($tag[0] == '&'){
                // Handle the entity.
                print($tag);
                $printedLength++;
            }
            else{
                // Handle the tag.
                $tagName = $match[1][0];
                if ($tag[1] == '/'){
                    // This is a closing tag.

                    $openingTag = array_pop($tags);
                    assert($openingTag == $tagName); // check that tags are properly nested.

                    print($tag);
                }
                else if ($tag[strlen($tag) - 2] == '/'){
                    // Self-closing tag.
                    print($tag);
                }
                else{
                    // Opening tag.
                    print($tag);
                    $tags[] = $tagName;
                }
            }

            // Continue after the tag.
            $position = $tagPosition + strlen($tag);
        }

        // Print any remaining text.
        if ($printedLength < $maxLength && $position < strlen($html))
            print(substr($html, $position, $maxLength - $printedLength));

        // Close any open tags.
        while (!empty($tags))
             printf('</%s>', array_pop($tags));


        $bufferOuput = ob_get_contents();

        ob_end_clean();         

        $html = $bufferOuput;   

        return $html;   

    }

?>

<body>
</body>
</html>

此函数结果如下所示:

Koks Nors tekstas。
立陶宛…

知道为什么这个函数会搞乱 UTF-8 吗？

最佳答案

Any ideas why this function is messing up with UTF-8 ?

一般问题是该函数不处理 UTF-8 字符串，而是处理 US-ASCII、Latin-1 或任何其他单字节字符集的字符串。

您正在寻求使该函数与 UTF-8 字符集兼容。 UTF-8 是一种多字节字符集。

为此，您必须验证该函数内使用的每个字符串函数是否正确处理 UTF-8 多字节字符集:

preg_match 需要一个带有 u modifier^Docs 的模式处理 UTF-8 字符串。
substr 需要替换为 mb_substr^Docs .
strlen 需要替换为 mb_strlen^Docs

当您处理 HTML 时，使用 DOMDocument 来操作 HTML block 可能会更省钱。这只是一个注释，它更加灵活并且工作正常。

关于PHP HTML 截断和 UTF-8，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/8227040/

PHP HTML 截断和 UTF-8

上一篇：jquery - 使用 jquery 包含图像的 HTML 选择框

下一篇：html - Chrome 扩展 HTML 文件系统访问