php - preg_split 对特殊字符不区分大小写

标签 php regex string

我正在编写一个脚本,该脚本将(餐厅菜单的)字符串按大写字母拆分。不幸的是,在捷克语中,有些单词以带有变音符号的特殊字符开头。按“常见”大写字母拆分菜肴效果很好,但我的正则表达式对某些特殊字符不区分大小写,并且它用 eg 拆分字符串。 š 什么时候应该只用 Š 拆分它。奇怪的是,一些特殊字符工作得很好,到目前为止唯一有问题的字母是 š/Š。 谁能帮帮我?

$dishes = preg_split('/(?=[ABCDEFGHIJKLMNOPQRSTUVWXYZÁČĎÉĚÍŇÓŘŠŤÚŮÝŽĹÔÄËÏÖÜŸ])/', $dishes); 
print_r($dishes);

以上代码返回

Array
(
    [0] =>
    [1] => Vepřová plec na paprice s těstovinami
    [2] => Domácí sekaná s bramborovou ka
    [3] => ší
    [4] => Těstoviny s rajským jablkem, olivami a žervé
    [5] => Domácí sekaná s e svíčkovou omáčkou
    [6] => Uzená kýta s čočkou na kyselo a vejcem 
    [7] => Vepřové  nudličky se zeleninou a rýží
    [8] => Pečená vepřová plec na medu a pivu s bramborami
    [9] => Plzeňský gulá
    [10] => š
    [11] => Hovězí zadní se svíčkovou omáčkou, citron, brusinky, 
    [12] => šlehačka
)

(不要介意第一个空行) 谢谢!

最佳答案

当您在 PHP preg 函数中使用正则表达式处理 Unicode 输入数据时,请记住使用 /u 正则表达式修饰符:

$dishes = preg_split('/(?=[ABCDEFGHIJKLMNOPQRSTUVWXYZÁČĎÉĚÍŇÓŘŠŤÚŮÝŽĹÔÄËÏÖÜŸ])/u', $dishes);

注意PHP文件需要以UTF8编码保存!

这将使正则表达式引擎将输入视为 Unicode 字符串,并将正确处理包含非 ASCII 字符的正则表达式模式。

如果您需要匹配任何 Unicode 大写字母,(正如 LucasTrzesniewski 在上面的评论中提到的)您可以使用 \p{Lu} Unicode 类别类:

$dishes = preg_split('/(?=\p{Lu})/u', $dishes);

请注意,有时您不必使用 /u 修饰符。参见 Daniel Klein's comment :

It is not a requirement, however, as you may have a need to break apart utf-8 sequences into single bytes. Most of the time, though, if you're working with utf-8 strings you should use the 'u' modifier.

If the subject doesn't contain any utf-8 sequences (i.e. characters in the range 0x00-0x7F only) but the pattern does, as far as I can work out, setting the 'u' modifier would have no effect on the result.

还有来自 hfuecks 的更多警告:

Regarding the validity of a UTF-8 string when using the /u pattern modifier, some things to be aware of;

  1. If the pattern itself contains an invalid UTF-8 character, you get an error (as mentioned in the docs above - "UTF-8 validity of the pattern is checked since PHP 4.3.5"
  2. When the subject string contains invalid UTF-8 sequences / codepoints, it basically result in a "quiet death" for the preg_* functions, where nothing is matched but without indication that the string is invalid UTF-8
  3. PCRE regards five and six octet UTF-8 character sequences as valid (both in patterns and the subject string) but these are not supported in Unicode ( see section 5.9 "Character Encoding" of the "Secure Programming for Linux and Unix HOWTO" - can be found at http://www.tldp.org/ and other places )
  4. For an example algorithm in PHP which tests the validity of a UTF-8 string (and discards five / six octet sequences) head to: http://hsivonen.iki.fi/php-utf8/

所以,尝试

$dishes = preg_split('/(?=\p{Lu})/u', $dishes);

可能足以满足您的情况,具体取决于您要实现的目标。

关于php - preg_split 对特殊字符不区分大小写,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/32843094/

相关文章:

c# - Regex 和 span 类如何在此代码中工作?

C-将字符串保存到数组并打印字符

php - Ajax php 在 mysql 中插入一些奇怪的 jquery 代码而不是问号

php - Doctrine - Query One-To-Many, Unidirectional with Join Table association from inverse side

PHP 和(太多)输入字段

查找多个字符串匹配的算法

java - 用于搜索特定字母帮助的代码

php - 通过在单个表中使用 count 来组合多个 mysql_num_rows 查询

python - 如何在 re.sub 中增加 lambda 函数的值?

javascript - 在正则表达式中匹配第一组、第二组或两者