我在使用 bash 正则表达式捕获格式为 (t|b|bug_|task_|)1234
的字符串中的数字时遇到问题。以下不起作用:
[[ $current_branch =~ ^(t|b|bug_|task_|)([0-9]+) ]]
但是一旦我把它改成这样:
[[ $current_branch =~ ^(t|b|bug_|task_)([0-9]+) ]]
它有效,但当然是错误的,因为它没有涵盖没有前缀的情况。我意识到在这种情况下我可以做
[[ $current_branch =~ ^(t|b|bug_|task_)?([0-9]+) ]]
并获得相同的结果,但我想知道为什么第二个示例不起作用。例如,该正则表达式似乎在 Ruby 中运行良好。
(这是在 GNU bash,版本 3.2.48(1)-release (x86_64-apple-darwin11)
,OSX Lion 上)
最佳答案
我确信正则表达式的工作版本和非工作版本之间的区别是基于不同的阅读方式 regex (7)
.我将引用整个相关部分,因为我认为它触及了您问题的核心:
Regular expressions ("RE"s), as defined in POSIX.2, come in two forms: modern REs (roughly those of egrep; POSIX.2 calls these "extended" REs) and obsolete REs (roughly those of ed(1); POSIX.2 "basic" REs). Obsolete REs mostly exist for backward compatibility in some old programs; they will be discussed at the end. POSIX.2 leaves some aspects of RE syntax and semantics open; "(!)" marks decisions on these aspects that may not be fully portable to other POSIX.2 implementations.
A (modern) RE is one(!) or more nonempty(!) branches, separated by '|'. It matches anything that matches one of the branches.
A branch is one(!) or more pieces, concatenated. It matches a match for the first, followed by a match for the second, etc.
A piece is an atom possibly followed by a single(!) '*', '+', '?', or bound. An atom followed by '*' matches a sequence of 0 or more matches of the atom. An atom followed by '+' matches a sequence of 1 or more matches of the atom. An atom followed by '?' matches a sequence of 0 or 1 matches of the atom.
A bound is '{' followed by an unsigned decimal integer, possibly followed by ',' possibly followed by another unsigned decimal integer, always followed by '}'. The integers must lie between 0 and RE_DUP_MAX (255(!)) inclusive, and if there are two of them, the first may not exceed the second. An atom followed by a bound containing one integer i and no comma matches a sequence of exactly i matches of the atom. An atom followed by a bound containing one integer i and a comma matches a sequence of i or more matches of the atom. An atom followed by a bound containing two integers i and j matches a sequence of i through j (inclusive) matches of the atom.
An atom is a regular expression enclosed in "()" (matching a match for the regular expression), an empty set of "()" (matching the null string)(!), a bracket expression (see below), '.' (matching any single character), '^' (matching the null string at the beginning of a line), '$' (matching the null string at the end of a line), a '\' followed by one of the characters "^.[$()|*+?{\" (matching that character taken as an ordinary character), a '\' followed by any other character(!) (matching that character taken as an ordinary character, as if the '\' had not been present(!)), or a single character with no other significance (matching that character). A '{' followed by a character other than a digit is an ordinary character, not the beginning of a bound(!). It is illegal to end an RE with '\'.
好的,这里有很多东西要打开。首先,请注意“(!)”符号表示存在开放或不可移植的问题。
关键问题在下一段:
A (modern) RE is one(!) or more nonempty(!) branches, separated by '|'.
你的情况是你有一个空分支。从“(!)”可以看出,空分支是一个开放或不可移植的问题。我认为这就是为什么它在某些系统上有效但在其他系统上无效的原因。 (我在 Cygwin 4.1.10(4)-release 上测试了它,但它没有工作,然后在 Linux 3.2.25(1)-release 上测试了它,并且成功了。这两个系统具有等效但不相同的手册页regex7.)
假设分支一定是非空的,一个分支可以是一个片段,可以是一个原子。
一个原子可以是“一组空的“()”(匹配空字符串)(!)”。 <sarcasm>
嗯,这真的很有帮助。 </sarcasm>
因此,POSIX 为空字符串指定了一个正则表达式,即 ()
, 但还会附加一个“(!)”,表示这是一个 Unresolved 问题,或者不可移植。
因为你要找的是匹配空串的分支,试试
[[ $current_branch =~ ^(t|b|bug_|task_|())([0-9]+) ]]
它使用 ()
正则表达式以匹配空字符串。 (这在我的 Cygwin 4.1.10(4)-release shell 中对我有用,而你原来的正则表达式没有。)
但是,虽然(希望)此建议在您当前的设置中对您有用,但不能保证它是可移植的。抱歉让您失望了。
关于regex - 如何在 bash 正则表达式中匹配 "something or nothing"?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/10577364/