javascript - 按 utf-8 字节位置提取子字符串

标签 javascript string utf-8 character-encoding utf-16

我有一个字符串、开始和长度,用于提取子字符串。这两个位置(开始和长度)都基于原始 UTF8 字符串中的字节偏移量。

但是,有一个问题:

开始和长度以字节为单位,所以我不能使用“substring”。 UTF8 字符串包含几个多字节字符。有没有一种超高效的方法来做到这一点? (我不需要解码字节...)

范例: var orig = '你好吗?'

s,e 可能是 3,3 以提取第二个字符(好)。我在找

var result = orig.substringBytes(3,3);

帮助!

更新 #1 在 C/C++ 中,我只是将其转换为字节数组,但不确定 javascript 中是否存在等效项。顺便说一句,是的,我们可以将它解析为字节数组并将其解析回字符串,但似乎应该有一种快速的方法可以在正确的位置切割它。假设 'orig' 是 1000000 个字符,s = 6 个字节,l = 3 个字节。

更新#2 感谢 zerkms 有用的重定向,我最终得到了以下内容,它正常工作 - 对多字节工作正常但对单字节却搞砸了字节。

function substrBytes(str, start, length)
{
    var ch, startIx = 0, endIx = 0, re = '';
    for (var i = 0; 0 < str.length; i++)
    {
        startIx = endIx++;

        ch = str.charCodeAt(i);
        do {
            ch = ch >> 8;   // a better way may exist to measure ch len
            endIx++;
        }
        while (ch);

        if (endIx > start + length)
        {
            return re;
        }
        else if (startIx >= start)
        {
            re += str[i];
        }
    }
}

更新#3 我不认为改变字符代码真的有效。当正确答案是三个时,我正在读取两个字节……不知何故,我总是忘记这一点。 UTF8和UTF16的codepoint是一样的,但是编码占用的字节数取决于编码!!!所以这不是执行此操作的正确方法。

最佳答案

我玩得很开心。希望这会有所帮助。

因为 Javascript 不允许对字符串进行直接字节访问,所以找到起始位置的唯一方法是正向扫描。


Update #3 I don't think shifting the char code really works. I'm reading two bytes when the correct answer is three... somehow I always forget this. The codepoint is the same for UTF8 and UTF16, but the number of bytes taken up on encoding depends on the encoding!!! So this is not the right way to do this.

这是不正确的——实际上在 javascript 中没有 UTF-8 字符串。根据 ECMAScript 262 规范,所有字符串 - 无论输入编码如何 - 都必须在内部存储为 UTF-16(“16 位无符号整数的[序列]”)。

考虑到这一点,8 位移位是正确的(但不必要)。

假设您的字符存储为 3 字节序列是错误的...
事实上,JS (ECMA-262) 字符串中的所有字符都是16 位(2 字节)长。

这可以通过手动将多字节字符转换为 utf-8 来解决,如下面的代码所示。


更新 此解决方案不处理 >= U+10000 的代码点,包括表情符号。参见 APerson's Answer以获得更完整的解决方案。


请参阅我的示例代码中解释的详细信息:

function encode_utf8( s )
{
  return unescape( encodeURIComponent( s ) );
}

function substr_utf8_bytes(str, startInBytes, lengthInBytes) {

   /* this function scans a multibyte string and returns a substring. 
    * arguments are start position and length, both defined in bytes.
    * 
    * this is tricky, because javascript only allows character level 
    * and not byte level access on strings. Also, all strings are stored
    * in utf-16 internally - so we need to convert characters to utf-8
    * to detect their length in utf-8 encoding.
    *
    * the startInBytes and lengthInBytes parameters are based on byte 
    * positions in a utf-8 encoded string.
    * in utf-8, for example: 
    *       "a" is 1 byte, 
            "ü" is 2 byte, 
       and  "你" is 3 byte.
    *
    * NOTE:
    * according to ECMAScript 262 all strings are stored as a sequence
    * of 16-bit characters. so we need a encode_utf8() function to safely
    * detect the length our character would have in a utf8 representation.
    * 
    * http://www.ecma-international.org/publications/files/ecma-st/ECMA-262.pdf
    * see "4.3.16 String Value":
    * > Although each value usually represents a single 16-bit unit of 
    * > UTF-16 text, the language does not place any restrictions or 
    * > requirements on the values except that they be 16-bit unsigned 
    * > integers.
    */

    var resultStr = '';
    var startInChars = 0;

    // scan string forward to find index of first character
    // (convert start position in byte to start position in characters)

    for (bytePos = 0; bytePos < startInBytes; startInChars++) {

        // get numeric code of character (is >128 for multibyte character)
        // and increase "bytePos" for each byte of the character sequence

        ch = str.charCodeAt(startInChars);
        bytePos += (ch < 128) ? 1 : encode_utf8(str[startInChars]).length;
    }

    // now that we have the position of the starting character,
    // we can built the resulting substring

    // as we don't know the end position in chars yet, we start with a mix of
    // chars and bytes. we decrease "end" by the byte count of each selected 
    // character to end up in the right position
    end = startInChars + lengthInBytes - 1;

    for (n = startInChars; startInChars <= end; n++) {
        // get numeric code of character (is >128 for multibyte character)
        // and decrease "end" for each byte of the character sequence
        ch = str.charCodeAt(n);
        end -= (ch < 128) ? 1 : encode_utf8(str[n]).length;

        resultStr += str[n];
    }

    return resultStr;
}

var orig = 'abc你好吗?';

alert('res: ' + substr_utf8_bytes(orig, 0, 2)); // alerts: "ab"
alert('res: ' + substr_utf8_bytes(orig, 2, 1)); // alerts: "c"
alert('res: ' + substr_utf8_bytes(orig, 3, 3)); // alerts: "你"
alert('res: ' + substr_utf8_bytes(orig, 6, 6)); // alerts: "好吗"

关于javascript - 按 utf-8 字节位置提取子字符串,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/11200451/

相关文章:

python - 在字符串 Python 中查找多个标记的字符串

c - 从文本文件中处理 C 中的 UTF-8 字符

javascript - 我如何制作具有 3 种颜色和很多曲线的多渐变背景?

javascript - 分别编辑每个div

javascript - EaselJS/canvas 在拖放过程中更改元素焦点

mysql - UTF-8 仅在 Grails 1.1 数据库表中

Python编码问题

javascript - 成功提交表单后如何减少 firestore 中的字段值?

string - 第 N 个字符串匹配,然后删除后面的字符串并使用 sed 放置一个新字符串

excel - 在 VBA 中搜索单元格引用的公式