c++ - 波斯语的 QString

我已经给出了一个需要支持波斯语的 Qt 项目。数据从服务器发送并使用第一行，我得到一个 QByteArray 并使用第二行将其转换为 QString:

    QByteArray readData = socket->readAll();
    QString DataAsString = QTextCodec::codecForUtfText(readData)->toUnicode(readData);

当发送的数据是英文时，一切正常，但是当是波斯文时，而不是

سلام

我明白了

Ø³Ù\u0084Ø§Ù\u0085

我提到了这个过程，所以人们不会建议使用 .tr 制作多语言应用程序的方法。这完全是关于文本和解码，而不是那些翻译方法。我的操作系统是 Windows 8.1(如果您需要了解的话)。

当服务器发送 سلام 时我得到这个十六进制值

0008d8b3d984d8a7d985

顺便说一下，出于我不知道的原因，服务器在开头发送了两个额外的字节。所以我用它切断了它:

DataAsString.remove(0,2);

在它被转换为 QString 之后，十六进制值在开始时有一些额外的值。

最佳答案

我太好奇了，等待回复并自己玩了一下:

我复制了文本 سلام(英文:“Hello”)并将其粘贴到 Nodepad++(在我的例子中使用 UTF-8 编码)。然后我切换到 View as Hex 并得到:

右侧的 ASCII 转储看起来与 OP 意外得到的有点相似。这让我相信 readData 中的字节是用 UTF-8 编码的。因此，我采用了公开的十六进制数字并制作了一些示例代码:

testQPersian.cc:

#include <QtWidgets>

int main(int argc, char **argv)
{
  QByteArray readData = "\xd8\xb3\xd9\x84\xd8\xa7\xd9\x85";
  QString textLatin1 = QString::fromLatin1(readData);
  QString textUtf8 = QString::fromUtf8(readData);
  QApplication app(argc, argv);
  QWidget qWin;
  QGridLayout qGrid;
  qGrid.addWidget(new QLabel("Latin-1:"), 0, 0);
  qGrid.addWidget(new QLabel(textLatin1), 0, 1);
  qGrid.addWidget(new QLabel("UTF-8:"), 1, 0);
  qGrid.addWidget(new QLabel(textUtf8), 1, 1);
  qWin.setLayout(&qGrid);
  qWin.show();
  return app.exec();
}

testQPersian.pro:

SOURCES = testQPersian.cc

QT += widgets

在 cygwin 中编译和测试在 Windows 10 上:

$ qmake-qt5 testQPersian.pro

$ make

$ ./testQPersian

同样，作为 Latin-1 的输出看起来有点类似于 OP 得到的以及 Notepad++ 暴露的。

作为 UTF-8 的输出提供了预期的文本(正如预期的那样，因为我提供了正确的 UTF-8 编码作为输入)。

可能是，ASCII/Latin-1 输出变化有点令人困惑。 – 存在多个字符字节编码，它们在下半部分 (0 ... 127) 共享 ASCII，但在上半部分 (128 ... 255) 具有不同的字节含义。 (查看 ISO/IEC 8859 了解我的意思。在 Unicode 作为本地化问题的最终解决方案流行之前，这些已作为本地化引入。)

波斯字符肯定具有超过 127 的所有 Unicode 代码点。(Unicode 也共享前 128 个代码点的 ASCII。)这些代码点编码在 UTF-8 中。作为多个字节的序列，其中每个字节都设置了 MSB(最高有效位 - 位 7)。因此，如果这些字节(不小心)被任何 ISO8859 编码解释，那么上半部分就变得相关了。因此，根据当前使用的 ISO8859 编码，这可能会产生不同的字形。

一些延续:

OP 发送了以下快照:

所以，它似乎代替了

d8 b3 d9 84 d8 a7 d9 85

他得到了

00 08 d8 b3 d9 84 d8 a7 d9 85

一种可能的解释:

服务器首先发送一个 16 位长度的 00 08 – 解释为 Big-Endian 16 位整数:8，然后是用 UTF-8 编码的 8 字节(看起来和我在上面播放时得到的一模一样)。 (AFAIK，如果发送方和接收方本身具有不同的字节序，则将 Big-Endian 用于二进制网络协议(protocol)以防止字节序问题并不罕见。)进一步阅读，例如这里:htons(3) - Linux man page

On the i386 the host byte order is Least Significant Byte first, whereas the network byte order, as used on the Internet, is Most Significant Byte first.

OP 声称使用了此协议(protocol) DataOutput – writeUTF :

Writes two bytes of length information to the output stream, followed by the modified UTF-8 representation of every character in the string s. If s is null, a NullPointerException is thrown. Each character in the string s is converted to a group of one, two, or three bytes, depending on the value of the character.

因此，解码可能如下所示:

QByteArray readData("\x00\x08\xd8\xb3\xd9\x84\xd8\xa7\xd9\x85", 10);
//QByteArray readData = socket->readAll();
unsigned length
  = ((uint8_t)readData[0] <<  8) + (uint8_t)readData[1];
QString text = QString::fromUtf8(dataRead.data() + 2, length);

前两个字节从readData中提取并组合成length(解码big-endian 16位整数)。
dataRead 的其余部分被转换为 QString，提供先前提取的 length。因此，readData 的前 2 个长度字节被跳过。

关于c++ - 波斯语的 QString，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/52018081/

c++ - 波斯语的 QString

上一篇：c++ - 将相同共享指针的拷贝存储在不同的 vector 中是一种好习惯吗？

下一篇：c++ - SFINAE 方法在 clang 中完全禁用基类的模板方法