我正在编写一个脚本,该脚本将(餐厅菜单的)字符串按大写字母拆分。不幸的是,在捷克语中,有些单词以带有变音符号的特殊字符开头。按“常见”大写字母拆分菜肴效果很好,但我的正则表达式对某些特殊字符不区分大小写,并且它用 eg 拆分字符串。 š 什么时候应该只用 Š 拆分它。奇怪的是,一些特殊字符工作得很好,到目前为止唯一有问题的字母是 š/Š。 谁能帮帮我?
$dishes = preg_split('/(?=[ABCDEFGHIJKLMNOPQRSTUVWXYZÁČĎÉĚÍŇÓŘŠŤÚŮÝŽĹÔÄËÏÖÜŸ])/', $dishes);
print_r($dishes);
以上代码返回
Array
(
[0] =>
[1] => Vepřová plec na paprice s těstovinami
[2] => Domácí sekaná s bramborovou ka
[3] => ší
[4] => Těstoviny s rajským jablkem, olivami a žervé
[5] => Domácí sekaná s e svíčkovou omáčkou
[6] => Uzená kýta s čočkou na kyselo a vejcem
[7] => Vepřové nudličky se zeleninou a rýží
[8] => Pečená vepřová plec na medu a pivu s bramborami
[9] => Plzeňský gulá
[10] => š
[11] => Hovězí zadní se svíčkovou omáčkou, citron, brusinky,
[12] => šlehačka
)
(不要介意第一个空行) 谢谢!
最佳答案
当您在 PHP preg
函数中使用正则表达式处理 Unicode 输入数据时,请记住使用 /u
正则表达式修饰符:
$dishes = preg_split('/(?=[ABCDEFGHIJKLMNOPQRSTUVWXYZÁČĎÉĚÍŇÓŘŠŤÚŮÝŽĹÔÄËÏÖÜŸ])/u', $dishes);
注意PHP文件需要以UTF8编码保存!
这将使正则表达式引擎将输入视为 Unicode 字符串,并将正确处理包含非 ASCII 字符的正则表达式模式。
如果您需要匹配任何 Unicode 大写字母,(正如 LucasTrzesniewski 在上面的评论中提到的)您可以使用 \p{Lu}
Unicode 类别类:
$dishes = preg_split('/(?=\p{Lu})/u', $dishes);
请注意,有时您不必使用 /u
修饰符。参见 Daniel Klein's comment :
It is not a requirement, however, as you may have a need to break apart utf-8 sequences into single bytes. Most of the time, though, if you're working with utf-8 strings you should use the
'u'
modifier.If the subject doesn't contain any utf-8 sequences (i.e. characters in the range 0x00-0x7F only) but the pattern does, as far as I can work out, setting the 'u' modifier would have no effect on the result.
还有来自 hfuecks 的更多警告:
Regarding the validity of a UTF-8 string when using the /u pattern modifier, some things to be aware of;
- If the pattern itself contains an invalid UTF-8 character, you get an error (as mentioned in the docs above - "UTF-8 validity of the pattern is checked since PHP 4.3.5"
- When the subject string contains invalid UTF-8 sequences / codepoints, it basically result in a "quiet death" for the preg_* functions, where nothing is matched but without indication that the string is invalid UTF-8
- PCRE regards five and six octet UTF-8 character sequences as valid (both in patterns and the subject string) but these are not supported in Unicode ( see section 5.9 "Character Encoding" of the "Secure Programming for Linux and Unix HOWTO" - can be found at http://www.tldp.org/ and other places )
- For an example algorithm in PHP which tests the validity of a UTF-8 string (and discards five / six octet sequences) head to: http://hsivonen.iki.fi/php-utf8/
所以,尝试
$dishes = preg_split('/(?=\p{Lu})/u', $dishes);
可能足以满足您的情况,具体取决于您要实现的目标。
关于php - preg_split 对特殊字符不区分大小写,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/32843094/