我想解析包含以下内容行的文件:
simple word abbr -8. (012) word, simple phrase, one another phrase - (simply dummy text of the printing; Lorem Ipsum : "Lorem" - has been the industry's standard dummy text, ever since the 1500s!; "It is a long established!"; "Sometimes by accident, sometimes on purpose (injected humour and the like)"; "sometimes on purpose") This is the end of the line
所以现在解释这些部分(由于这里的标记,不是所有的空格都被描述了):
simple word
是由空格分隔的一个或几个词(短语)abbr -
是字符串的固定部分(永远不会改变)8
- 可选号码 .
- 始终包含 word, simple phrase, one another phrase
- 用逗号分隔的一个或多个单词或短语 - (
- 固定部分,始终包含simply dummy text of the printing; Lorem Ipsum : "Lorem" - has been the industry's standard dummy text, ever since the 1500s!;
-(可选)一个或多个由 ;
分隔的短语"It is a long established!"; "Sometimes by accident, sometimes on purpose (injected humour and the like)"; "sometimes on purpose"
- (可选)一个或多个带引号的短语 "
由 ;
分隔) This is the end of the line
- 始终包含 在最坏的情况下,子句中没有短语,但这种情况并不常见:应该有一个没有增加引号(
phrase1
类型)或带有它们(phrase2
类型)的短语。所以这些短语是自然语言句子(所有可能的标点符号)......
但是:
phrase1
或 phrase2
类型:(
之间和 ;
或 ;
和 ;
或 ;
和 )
甚至在(
之间和 )
加上引号,则是phrase2
类型 phrase1
类型 由于为这样的输入编写正则表达式(PCRE)是一种矫枉过正,所以我查看了解析方法(EBNF 或类似方法)。我最终得到了一个 PEG.js 解析器生成器。我创建了一个基本的语法变体(甚至不处理子句中不同短语的部分):
start = term _ "abbr" _ "-" .+
term = word (_? word !(_ "abbr" _ "-"))+
word = letters:letter+ {return letters.join("")}
letter = [A-Za-z]
_ "whitespace"
= [ \t\n\r]*
或(区别仅在于
" abbr -"
和 "_ "abbr" _ "-""
):start = term " abbr -" .+
term = word (_? word !(" abbr -"))+
word = letters:letter+ {return letters.join("")}
letter = [A-Za-z]
_ "whitespace"
= [ \t\n\r]*
但是即使是这种简单的语法也无法解析字符串的开头。错误是:
Parse Error Expected [A-Za-z] but " " found.
Parse Error Expected "abbr" but "-" found.
所以看起来问题出在歧义上:
"abbr"
与 term
一起消耗作为 word
token 。虽然我定义了规则 !(" abbr -")
,我认为这是一个意思,下一个 word
如果下一个子字符串不是 " abbr -"
, token 将仅被消耗种类。我没有找到任何很好的例子来解释 PEG.js 的以下表达式,这在我看来是上述问题的可能解决方案 [来自:http://pegjs.majda.cz/documentation] :
& expression
! expression
$ expression
& { predicate }
! { predicate }
特尔;博士:
与 PEG.js 相关:
& expression
! expression
$ expression
& { predicate }
! { predicate }
一般问题:
-
字符。)更新1:
我找到了解决匹配
"abbr -"
问题的规则歧义:term = term:(word (!" abbr -" _? word))+ {return term.join("")}
但结果看起来很奇怪:
[
"simple, ,word",
" abbr -",
[
"8",
...
],
...
]
如果删除谓词:
term = term:(word (!" abbr -" _? word))+
:[
[
"simple",
[
[
undefined,
[
" "
],
"word"
]
]
],
" abbr -",
[
"8",
".",
" ",
"(",
...
],
...
]
我期待这样的事情:
[
[
"simple word"
],
" abbr -",
[
"8",
".",
" ",
"(",
...
],
...
]
或者至少:
[
[
"simple",
[
" ",
"word"
]
],
" abbr -",
[
"8",
".",
" ",
"(",
...
],
...
]
表达式被分组,所以 为什么它在这么多嵌套级别中被分开甚至
undefined
是否包含在输出中?是否有任何通用规则可以根据规则中的表达式折叠结果?更新2:
我创建了语法,以便根据需要进行解析,尽管我还没有确定这种语法创建的清晰过程:
start
= (term:term1 (" abbr -" number "." _ "("number:number") "{return number}) terms:terms2 ((" - (" phrases:phrases ")" .+){return phrases}))
//start //alternative way = looks better
// = (term:term1 " abbr -" number "." _ "("number:number") " terms:terms2 " - (" phrases:phrases ")" .+){return {term: term, number: number, phrases:phrases}}
term1
= term1:(
start_word:word
(rest_words:(
rest_word:(
(non_abbr:!" abbr -"{return non_abbr;})
(space:_?{return space[0];}) word){return rest_word.join("");})+{return rest_words.join("")}
)) {return term1.join("");}
terms2
= terms2:(start_word:word (rest_words:(!" - (" ","?" "? word)+){rest_words = rest_words.map(function(array) {
return array.filter(function(n){return n != null;}).join("");
}); return start_word + rest_words.join("")})
phrases
// = ((phrase_t:(phrase / '"' phrase '"') ";"?" "?){return phrase_t})+
= (( (phrase:(phrase2 / phrase1) ";"?" "?) {return phrase;})+)
phrase2
= (('"'pharse2:(phrase)'"'){return {phrase2: pharse2}})
phrase1
= ((pharse1:phrase){return {phrase1: pharse1}})
phrase
= (general_phrase:(!(';' / ')' / '";' / '")') .)+ ){return general_phrase.map(function(array){return array[1]}).join("")}
word = letters:letter+ {return letters.join("")}
letter = [A-Za-z]
number = digits:digit+{return digits.join("")}
digit = [0-9]
_ "whitespace"
= [ \t\n\r]*
它可以在 PEG.js 作者的站点上进行测试:[ http://pegjs.majda.cz/online]或在 PEG.js Web-IDE 上:[ http://peg.arcanis.fr/]
如果有人对前面的问题有答案(即语法消歧的一般方法,PEG.js 中可用表达式的示例)以及对语法本身的改进建议(我认为这与理想的语法相去甚远) ,我将不胜感激!
最佳答案
so why is it separated in so many nesting levels and even undefined is included in the output?
如果你看 documentation for PEG.js ,您会看到几乎每个运算符都将其操作数的结果收集到一个数组中。
undefined
由 !
返回运算符(operator)。$
运算符绕过所有这些嵌套,只为您提供匹配的实际字符串,例如:[a-z]+
将给出一个字母数组,但 $[a-z]+
将给出一串字母。我认为这里的大部分解析都遵循以下模式:“给我一切,直到我看到这个字符串”。您应该首先使用
!
在 PEG 中表达这一点。以确保您没有点击终止字符串,然后只使用下一个字符。例如,要将所有内容都设置为“abbr -”:(!" abbr -" .)+
如果终止字符串是单个字符,则可以使用
[^]
作为其缩写形式,例如:[^x]+
是一种更简短的说法 (!"x" .)+
.解析逗号/分号分隔的短语而不是逗号/分号终止的短语有点烦人,但将它们视为可选的终止符似乎有效(使用一些
trim
ing)。start = $(!" abbr -" .)+ " abbr -" $num "." [ ]? "(012)"
phrase_comma+ "- (" noq_phrase_semi+ q_phrase_semi+ ")"
$.*
phrase_comma = p:$[^-,]+ [, ]* { return p.trim() }
noq_phrase_semi = !'"' p:$[^;]+ [; ]* { return p.trim() }
q_phrase_semi = '"' p:$[^"]+ '"' [; ]* { return p }
num = [0-9]+
给
[
"simple word",
" abbr -",
"8",
".",
" ",
"(012)",
[
"word",
"simple phrase",
"one another phrase"
],
"- (",
[
"simply dummy text of the printing",
"Lorem Ipsum : \"Lorem\" - has been the industry's standard dummy text, ever since the 1500s!"
],
[
"It is a long established!",
"Sometimes by accident, sometimes on purpose (injected humour and the like)",
"sometimes on purpose"
],
")",
" This is the end of the line"
]
关于regex - 歧义语法和 PEG.js 的问题(未找到示例),我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/24659684/