python - 删除罗马数字

我有一段包含拉丁数字的文本，例如 I、II 等，有时后面跟着点 (I.)，有时后面没有点 (I)。我想通过在 python 中使用正则表达式来删除它们。我可以定义以下函数，但看起来很基本并且需要很多时间。我想知道是否有其他方法可以删除它们？

def clean(text):
  text = text.replace("Ι.", '&')
  text = text.replace("II.", '&')
  text = text.replace("III.", '&')
  text = text.replace("IV.", '&')
  text = text.replace("V.", '&')
  text = text.replace("VI.", '&')
  text = text.replace("VII.", '&')
  text = text.replace("VIII.", '&')
  text = text.replace("IX.", '&')
  text = text.replace("X.", '&')
  text = text.replace("XI.", '&')
  text = text.replace("XII.", '&')
  text = text.replace("XIII", '&')
  text = text.replace("XIV.", '&')

  return text

最佳答案

使用

def clean(text):
    pattern = r"\b(?=[MDCLXVIΙ])M{0,4}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})([IΙ]X|[IΙ]V|V?[IΙ]{0,3})\b\.?"
    return re.sub(pattern, '&', text)

参见regex proof 。如有必要，添加更多非标准字母，例如 I。

说明

--------------------------------------------------------------------------------
  \b                       the boundary between a word char (\w) and
                           something that is not a word char
--------------------------------------------------------------------------------
  (?=                      look ahead to see if there is:
--------------------------------------------------------------------------------
    [MDCLXVIΙ]          any character of: 'M', 'D', 'C', 'L',
                             'X', 'V', 'I', '&', '#', '9', '2', '1',
                             ';'
--------------------------------------------------------------------------------
  )                        end of look-ahead
--------------------------------------------------------------------------------
  M{0,4}                   'M' (between 0 and 4 times (matching the
                           most amount possible))
--------------------------------------------------------------------------------
  (                        group and capture to \1:
--------------------------------------------------------------------------------
    CM                       'CM'
--------------------------------------------------------------------------------
   |                        OR
--------------------------------------------------------------------------------
    CD                       'CD'
--------------------------------------------------------------------------------
   |                        OR
--------------------------------------------------------------------------------
    D?                       'D' (optional (matching the most amount
                             possible))
--------------------------------------------------------------------------------
    C{0,3}                   'C' (between 0 and 3 times (matching the
                             most amount possible))
--------------------------------------------------------------------------------
  )                        end of \1
--------------------------------------------------------------------------------
  (                        group and capture to \2:
--------------------------------------------------------------------------------
    XC                       'XC'
--------------------------------------------------------------------------------
   |                        OR
--------------------------------------------------------------------------------
    XL                       'XL'
--------------------------------------------------------------------------------
   |                        OR
--------------------------------------------------------------------------------
    L?                       'L' (optional (matching the most amount
                             possible))
--------------------------------------------------------------------------------
    X{0,3}                   'X' (between 0 and 3 times (matching the
                             most amount possible))
--------------------------------------------------------------------------------
  )                        end of \2
--------------------------------------------------------------------------------
  (                        group and capture to \3:
--------------------------------------------------------------------------------
    [IΙ]                any character of: 'I', '&', '#', '9',
                             '2', '1', ';'
--------------------------------------------------------------------------------
    X                        'X'
--------------------------------------------------------------------------------
   |                        OR
--------------------------------------------------------------------------------
    [IΙ]                any character of: 'I', '&', '#', '9',
                             '2', '1', ';'
--------------------------------------------------------------------------------
    V                        'V'
--------------------------------------------------------------------------------
   |                        OR
--------------------------------------------------------------------------------
    V?                       'V' (optional (matching the most amount
                             possible))
--------------------------------------------------------------------------------
    [IΙ]{0,3}           any character of: 'I', '&', '#', '9',
                             '2', '1', ';' (between 0 and 3 times
                             (matching the most amount possible))
--------------------------------------------------------------------------------
  )                        end of \3
--------------------------------------------------------------------------------
  \b                       the boundary between a word char (\w) and
                           something that is not a word char
--------------------------------------------------------------------------------
  \.?                      '.' (optional (matching the most amount
                           possible))

关于python - 删除罗马数字，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/68048675/

python - 删除罗马数字

上一篇：mysql - MYSQL中的LIMIT使用全索引扫描而不是范围扫描

下一篇：wso2-api-manager - WSO2 APIM 3.2.0 管理门户无法访问