php - preg_split 对特殊字符不区分大小写

我正在编写一个脚本，该脚本将(餐厅菜单的)字符串按大写字母拆分。不幸的是，在捷克语中，有些单词以带有变音符号的特殊字符开头。按“常见”大写字母拆分菜肴效果很好，但我的正则表达式对某些特殊字符不区分大小写，并且它用 eg 拆分字符串。 š 什么时候应该只用 Š 拆分它。奇怪的是，一些特殊字符工作得很好，到目前为止唯一有问题的字母是 š/Š。谁能帮帮我？

$dishes = preg_split('/(?=[ABCDEFGHIJKLMNOPQRSTUVWXYZÁČĎÉĚÍŇÓŘŠŤÚŮÝŽĹÔÄËÏÖÜŸ])/', $dishes); 
print_r($dishes);

以上代码返回

Array
(
    [0] =>
    [1] => Vepřová plec na paprice s těstovinami
    [2] => Domácí sekaná s bramborovou ka
    [3] => ší
    [4] => Těstoviny s rajským jablkem, olivami a žervé
    [5] => Domácí sekaná s e svíčkovou omáčkou
    [6] => Uzená kýta s čočkou na kyselo a vejcem 
    [7] => Vepřové  nudličky se zeleninou a rýží
    [8] => Pečená vepřová plec na medu a pivu s bramborami
    [9] => Plzeňský gulá
    [10] => š
    [11] => Hovězí zadní se svíčkovou omáčkou, citron, brusinky, 
    [12] => šlehačka
)

(不要介意第一个空行) 谢谢!

最佳答案

当您在 PHP preg 函数中使用正则表达式处理 Unicode 输入数据时，请记住使用 /u 正则表达式修饰符:

$dishes = preg_split('/(?=[ABCDEFGHIJKLMNOPQRSTUVWXYZÁČĎÉĚÍŇÓŘŠŤÚŮÝŽĹÔÄËÏÖÜŸ])/u', $dishes);

注意PHP文件需要以UTF8编码保存!

这将使正则表达式引擎将输入视为 Unicode 字符串，并将正确处理包含非 ASCII 字符的正则表达式模式。

如果您需要匹配任何 Unicode 大写字母，(正如 LucasTrzesniewski 在上面的评论中提到的)您可以使用 \p{Lu} Unicode 类别类:

$dishes = preg_split('/(?=\p{Lu})/u', $dishes);

请注意，有时您不必使用 /u 修饰符。参见 Daniel Klein's comment :

It is not a requirement, however, as you may have a need to break apart utf-8 sequences into single bytes. Most of the time, though, if you're working with utf-8 strings you should use the 'u' modifier.

If the subject doesn't contain any utf-8 sequences (i.e. characters in the range 0x00-0x7F only) but the pattern does, as far as I can work out, setting the 'u' modifier would have no effect on the result.

还有来自 hfuecks 的更多警告:

Regarding the validity of a UTF-8 string when using the /u pattern modifier, some things to be aware of;

If the pattern itself contains an invalid UTF-8 character, you get an error (as mentioned in the docs above - "UTF-8 validity of the pattern is checked since PHP 4.3.5"

When the subject string contains invalid UTF-8 sequences / codepoints, it basically result in a "quiet death" for the preg_* functions, where nothing is matched but without indication that the string is invalid UTF-8

PCRE regards five and six octet UTF-8 character sequences as valid (both in patterns and the subject string) but these are not supported in Unicode ( see section 5.9 "Character Encoding" of the "Secure Programming for Linux and Unix HOWTO" - can be found at http://www.tldp.org/ and other places )

For an example algorithm in PHP which tests the validity of a UTF-8 string (and discards five / six octet sequences) head to: http://hsivonen.iki.fi/php-utf8/

所以，尝试

$dishes = preg_split('/(?=\p{Lu})/u', $dishes);

可能足以满足您的情况，具体取决于您要实现的目标。

关于php - preg_split 对特殊字符不区分大小写，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/32843094/

php - preg_split 对特殊字符不区分大小写

上一篇：php - 使用 CakePHP 1.3 将图像上传到服务器

下一篇：php - 使用 PHP 获取任何给定输入的唯一哈希值