regex - 用于将 PCRE 正则表达式转换为 emacs 正则表达式的 Elisp 机制

标签 regex emacs elisp pcre

我承认对喜欢有明显的偏见PCRE regexps 比 emacs 好得多,如果没有其他原因,当我输入 '(' 我几乎总是想要一个分组运算符。当然,\w 和类似的比其他等价物方便得多。

但是,当然,期望改变 emacs 的内部结构是很疯狂的。但是我认为应该可以从 PCRE experssion 转换为 emacs 表达式,并进行所有需要的转换,以便我可以写:

(defun my-super-regexp-function ...
   (search-forward (pcre-convert "__\\w: \d+")))

(或类似)。

有人知道可以做到这一点的elisp库吗?

编辑:从下面的答案中选择一个回复...

哇,我喜欢从 4 天的假期回来寻找大量有趣的答案来整理!我喜欢这两种类型的解决方案的工作。

最后,看起来解决方案的 exec-a-script 和直接 elisp 版本都可以工作,但是从纯粹的速度和“正确性”方法来看,elisp 版本肯定是人们更喜欢的版本(包括我自己) .

最佳答案

https://github.com/joddie/pcre2el是这个答案的最新版本。

pcre2el or rxt (RegeXp Translator or RegeXp Tools) is a utility for working with regular expressions in Emacs, based on a recursive-descent parser for regexp syntax. In addition to converting (a subset of) PCRE syntax into its Emacs equivalent, it can do the following:

  • convert Emacs syntax to PCRE
  • convert either syntax to rx, an S-expression based regexp syntax
  • untangle complex regexps by showing the parse tree in rx form and highlighting the corresponding chunks of code
  • show the complete list of strings (productions) matching a regexp, provided the list is finite
  • provide live font-locking of regexp syntax (so far only for Elisp buffers – other modes on the TODO list)


原始答案的文本如下......

这是一个 quick and ugly Emacs lisp solution (编辑:现在更永久地位于 here )。它主要基于 pcrepattern 中的描述手册页,并逐个标记地工作,仅转换以下结构:
  • 括号分组( .. )
  • 交替|
  • 数字重复 {M,N}
  • 字符串引用 \Q .. \E
  • 简单的字符转义:\a , \c , \e , \f , \n , \r , \t , \x , 和 \ + 八进制数字
  • 字符类:\d , \D , \h , \H , \s , \S , \v , \V
  • \w\W保持原样(使用 Emacs 自己的单词和非单词字符的想法)

  • 它不会对更复杂的 PCRE 断言做任何事情,但它会尝试在字符类中转换转义符。在字符类包括类似 \D 的情况下,这是通过转换为具有交替的非捕获组来完成的。

    它通过了我为它编写的测试,但肯定存在错误,并且逐个 token 扫描的方法可能很慢。换句话说,没有保修。但也许出于某些目的,它可以完成工作中更简单的部分。欢迎有兴趣的人士改进它;-)
    (eval-when-compile (require 'cl))
    
    (defvar pcre-horizontal-whitespace-chars
      (mapconcat 'char-to-string
                 '(#x0009 #x0020 #x00A0 #x1680 #x180E #x2000 #x2001 #x2002 #x2003
                          #x2004 #x2005 #x2006 #x2007 #x2008 #x2009 #x200A #x202F
                          #x205F #x3000)
                 ""))
    
    (defvar pcre-vertical-whitespace-chars
      (mapconcat 'char-to-string
                 '(#x000A #x000B #x000C #x000D #x0085 #x2028 #x2029) ""))
    
    (defvar pcre-whitespace-chars
      (mapconcat 'char-to-string '(9 10 12 13 32) ""))
    
    (defvar pcre-horizontal-whitespace
      (concat "[" pcre-horizontal-whitespace-chars "]"))
    
    (defvar pcre-non-horizontal-whitespace
      (concat "[^" pcre-horizontal-whitespace-chars "]"))
    
    (defvar pcre-vertical-whitespace
      (concat "[" pcre-vertical-whitespace-chars "]"))
    
    (defvar pcre-non-vertical-whitespace
      (concat "[^" pcre-vertical-whitespace-chars "]"))
    
    (defvar pcre-whitespace (concat "[" pcre-whitespace-chars "]"))
    
    (defvar pcre-non-whitespace (concat "[^" pcre-whitespace-chars "]"))
    
    (eval-when-compile
      (defmacro pcre-token-case (&rest cases)
        "Consume a token at point and evaluate corresponding forms.
    
    CASES is a list of `cond'-like clauses, (REGEXP FORMS
    ...). Considering CASES in order, if the text at point matches
    REGEXP then moves point over the matched string and returns the
    value of FORMS. Returns `nil' if none of the CASES matches."
        (declare (debug (&rest (sexp &rest form))))
        `(cond
          ,@(mapcar
             (lambda (case)
               (let ((token (car case))
                     (action (cdr case)))
                 `((looking-at ,token)
                   (goto-char (match-end 0))
                   ,@action)))
             cases)
          (t nil))))
    
    (defun pcre-to-elisp (pcre)
      "Convert PCRE, a regexp in PCRE notation, into Elisp string form."
      (with-temp-buffer
        (insert pcre)
        (goto-char (point-min))
        (let ((capture-count 0) (accum '())
              (case-fold-search nil))
          (while (not (eobp))
            (let ((translated
                   (or
                    ;; Handle tokens that are treated the same in
                    ;; character classes
                    (pcre-re-or-class-token-to-elisp)   
    
                    ;; Other tokens
                    (pcre-token-case
                     ("|" "\\|")
                     ("(" (incf capture-count) "\\(")
                     (")" "\\)")
                     ("{" "\\{")
                     ("}" "\\}")
    
                     ;; Character class
                     ("\\[" (pcre-char-class-to-elisp))
    
                     ;; Backslash + digits => backreference or octal char?
                     ("\\\\\\([0-9]+\\)"
                      (let* ((digits (match-string 1))
                             (dec (string-to-number digits)))
                        ;; from "man pcrepattern": If the number is
                        ;; less than 10, or if there have been at
                        ;; least that many previous capturing left
                        ;; parentheses in the expression, the entire
                        ;; sequence is taken as a back reference.   
                        (cond ((< dec 10) (concat "\\" digits))
                              ((>= capture-count dec)
                               (error "backreference \\%s can't be used in Emacs regexps"
                                      digits))
                              (t
                               ;; from "man pcrepattern": if the
                               ;; decimal number is greater than 9 and
                               ;; there have not been that many
                               ;; capturing subpatterns, PCRE re-reads
                               ;; up to three octal digits following
                               ;; the backslash, and uses them to
                               ;; generate a data character. Any
                               ;; subsequent digits stand for
                               ;; themselves.
                               (goto-char (match-beginning 1))
                               (re-search-forward "[0-7]\\{0,3\\}")
                               (char-to-string (string-to-number (match-string 0) 8))))))
    
                     ;; Regexp quoting.
                     ("\\\\Q"
                      (let ((beginning (point)))
                        (search-forward "\\E")
                        (regexp-quote (buffer-substring beginning (match-beginning 0)))))
    
                     ;; Various character classes
                     ("\\\\d" "[0-9]")
                     ("\\\\D" "[^0-9]")
                     ("\\\\h" pcre-horizontal-whitespace)
                     ("\\\\H" pcre-non-horizontal-whitespace)
                     ("\\\\s" pcre-whitespace)
                     ("\\\\S" pcre-non-whitespace)
                     ("\\\\v" pcre-vertical-whitespace)
                     ("\\\\V" pcre-non-vertical-whitespace)
    
                     ;; Use Emacs' native notion of word characters
                     ("\\\\[Ww]" (match-string 0))
    
                     ;; Any other escaped character
                     ("\\\\\\(.\\)" (regexp-quote (match-string 1)))
    
                     ;; Any normal character
                     ("." (match-string 0))))))
              (push translated accum)))
          (apply 'concat (reverse accum)))))
    
    (defun pcre-re-or-class-token-to-elisp ()
      "Consume the PCRE token at point and return its Elisp equivalent.
    
    Handles only tokens which have the same meaning in character
    classes as outside them."
      (pcre-token-case
       ("\\\\a" (char-to-string #x07))  ; bell
       ("\\\\c\\(.\\)"                  ; control character
        (char-to-string
         (- (string-to-char (upcase (match-string 1))) 64)))
       ("\\\\e" (char-to-string #x1b))  ; escape
       ("\\\\f" (char-to-string #x0c))  ; formfeed
       ("\\\\n" (char-to-string #x0a))  ; linefeed
       ("\\\\r" (char-to-string #x0d))  ; carriage return
       ("\\\\t" (char-to-string #x09))  ; tab
       ("\\\\x\\([A-Za-z0-9]\\{2\\}\\)"
        (char-to-string (string-to-number (match-string 1) 16)))
       ("\\\\x{\\([A-Za-z0-9]*\\)}"
        (char-to-string (string-to-number (match-string 1) 16)))))
    
    (defun pcre-char-class-to-elisp ()
      "Consume the remaining PCRE character class at point and return its Elisp equivalent.
    
    Point should be after the opening \"[\" when this is called, and
    will be just after the closing \"]\" when it returns."
      (let ((accum '("["))
            (pcre-char-class-alternatives '())
            (negated nil))
        (when (looking-at "\\^")
          (setq negated t)
          (push "^" accum)
          (forward-char))
        (when (looking-at "\\]") (push "]" accum) (forward-char))
    
        (while (not (looking-at "\\]"))
          (let ((translated
                 (or
                  (pcre-re-or-class-token-to-elisp)
                  (pcre-token-case              
                   ;; Backslash + digits => always an octal char
                   ("\\\\\\([0-7]\\{1,3\\}\\)"    
                    (char-to-string (string-to-number (match-string 1) 8)))
    
                   ;; Various character classes. To implement negative char classes,
                   ;; we cons them onto the list `pcre-char-class-alternatives' and
                   ;; transform the char class into a shy group with alternation
                   ("\\\\d" "0-9")
                   ("\\\\D" (push (if negated "[0-9]" "[^0-9]")
                                  pcre-char-class-alternatives) "")
                   ("\\\\h" pcre-horizontal-whitespace-chars)
                   ("\\\\H" (push (if negated
                                      pcre-horizontal-whitespace
                                    pcre-non-horizontal-whitespace)
                                  pcre-char-class-alternatives) "")
                   ("\\\\s" pcre-whitespace-chars)
                   ("\\\\S" (push (if negated
                                      pcre-whitespace
                                    pcre-non-whitespace)
                                  pcre-char-class-alternatives) "")
                   ("\\\\v" pcre-vertical-whitespace-chars)
                   ("\\\\V" (push (if negated
                                      pcre-vertical-whitespace
                                    pcre-non-vertical-whitespace)
                                  pcre-char-class-alternatives) "")
                   ("\\\\w" (push (if negated "\\W" "\\w") 
                                  pcre-char-class-alternatives) "")
                   ("\\\\W" (push (if negated "\\w" "\\W") 
                                  pcre-char-class-alternatives) "")
    
                   ;; Leave POSIX syntax unchanged
                   ("\\[:[a-z]*:\\]" (match-string 0))
    
                   ;; Ignore other escapes
                   ("\\\\\\(.\\)" (match-string 0))
    
                   ;; Copy everything else
                   ("." (match-string 0))))))
            (push translated accum)))
        (push "]" accum)
        (forward-char)
        (let ((class
               (apply 'concat (reverse accum))))
          (when (or (equal class "[]")
                    (equal class "[^]"))
            (setq class ""))
          (if (not pcre-char-class-alternatives)
              class
            (concat "\\(?:"
                    class "\\|"
                    (mapconcat 'identity
                               pcre-char-class-alternatives
                               "\\|")
                    "\\)")))))
    

    关于regex - 用于将 PCRE 正则表达式转换为 emacs 正则表达式的 Elisp 机制,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/9118183/

    相关文章:

    emacs - 参数内的空格字符 (emacs lisp)

    emacs - 确定模式行中的窗口焦点?

    emacs - Elisp:保存一个位置,在它之前插入文本,然后返回到相同的位置

    regex - 如何在 Ansible 的 lineinfile 模块中的正则表达式中转义 1 个或多个空格?

    正则表达式返回一组中的所有匹配项

    regex - 修改 `sed` 以从字符串中删除确切的标签

    emacs - 如何检查与特定名称关联的 agda 术语是否依赖于 hole?

    python - 在 Python 正则表达式 split() 中访问定界符

    python - ipython 来自 Windows XP 上的 emacs,没有提示也没有打印输出

    emacs - 向(互动)添加完成