php - 测试非 UTF-8 字符串

我已阅读 some其他 threads关于这个问题，但我不明白我做错了什么。

我有一个函数

public function reEncode($item)
{
    if (! mb_detect_encoding($item, 'utf-8', true)) {
        $item = utf8_encode($item);
    }

    return $item;
}

我正在为此编写测试。我想测试一个不是 UTF-8 的字符串，看看这个语句是否命中。我在创建测试字符串时遇到问题。

$contents = file_get_contents('CyrillicKOI8REncoded.txt');
var_dump(mb_detect_encoding($contents));

$sanitized = $this->reEncode($contents);
var_dump(mb_detect_encoding($sanitized));

最初我在一个文件上使用了 file_get_contents 我用各种编码在 sublime 中编码； Cyrillic (KOI8-R)、HEX 和 DOS (CP 437) 因为它已声明 file_get_contents() 忽略文件编码。这似乎是真的，因为返回的字符是一团乱麻。

就是说，每次我对这些变量使用 mb_detect_encoding() 时，我总是得到 ASCII 或 UTF-8。该语句永远不会被触发，因为 ASCII 是 UTF-8 的子集。

所以我尝试了 mb_convert_encoding() 和 iconv() 将基本字符串转换为 UTF-16，UTF- 32, base64, hex 等等但每次 mb_detect_encoding() 返回 ASCII 或 UTF-8

在我的测试中，我想在调用此函数之前和之后断言编码类型。

$sanitized = $this->reEncode($contents);

$this->assertEquals('UTF-32', mb_detect_encoding($contents));
$this->assertEquals('UTF-8', mb_detect_encoding($sanitized));

我不明白我犯了什么基本错误，不断从 mb_detect_encoding() 返回 ASCII 或 UTF-8。

最佳答案

好吧，事实证明你必须使用 strict 来检查，否则 mb_detect_encoding() 函数几乎没用。

$item = mb_convert_encoding('Котёнок', 'KOI8-R');

$sanitized = $this->reEncode($item);

$this->assertEquals('KOI8-R', mb_detect_encoding($item, 'KOI8-R', true));
$this->assertEquals('UTF-8', mb_detect_encoding($sanitised, 'UTF-8', true));

关于php - 测试非 UTF-8 字符串，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/41985486/

php - 测试非 UTF-8 字符串

上一篇：java - @Transactional测试中未填充联接表

下一篇：angularjs - jasmineReporters.JUnitXmlReporter 不生成 XML 报告