python - 删除罗马数字

标签 python regex

我有一段包含拉丁数字的文本,例如 I、II 等,有时后面跟着点 (I.),有时后面没有点 (I)。我想通过在 python 中使用正则表达式来删除它们。我可以定义以下函数,但看起来很基本并且需要很多时间。我想知道是否有其他方法可以删除它们?

def clean(text):
  text = text.replace("Ι.", '&')
  text = text.replace("II.", '&')
  text = text.replace("III.", '&')
  text = text.replace("IV.", '&')
  text = text.replace("V.", '&')
  text = text.replace("VI.", '&')
  text = text.replace("VII.", '&')
  text = text.replace("VIII.", '&')
  text = text.replace("IX.", '&')
  text = text.replace("X.", '&')
  text = text.replace("XI.", '&')
  text = text.replace("XII.", '&')
  text = text.replace("XIII", '&')
  text = text.replace("XIV.", '&')

  return text

最佳答案

使用

def clean(text):
    pattern = r"\b(?=[MDCLXVIΙ])M{0,4}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})([IΙ]X|[IΙ]V|V?[IΙ]{0,3})\b\.?"
    return re.sub(pattern, '&', text)

参见regex proof 。如有必要,添加更多非标准字母,例如 I

说明

--------------------------------------------------------------------------------
  \b                       the boundary between a word char (\w) and
                           something that is not a word char
--------------------------------------------------------------------------------
  (?=                      look ahead to see if there is:
--------------------------------------------------------------------------------
    [MDCLXVIΙ]          any character of: 'M', 'D', 'C', 'L',
                             'X', 'V', 'I', '&', '#', '9', '2', '1',
                             ';'
--------------------------------------------------------------------------------
  )                        end of look-ahead
--------------------------------------------------------------------------------
  M{0,4}                   'M' (between 0 and 4 times (matching the
                           most amount possible))
--------------------------------------------------------------------------------
  (                        group and capture to \1:
--------------------------------------------------------------------------------
    CM                       'CM'
--------------------------------------------------------------------------------
   |                        OR
--------------------------------------------------------------------------------
    CD                       'CD'
--------------------------------------------------------------------------------
   |                        OR
--------------------------------------------------------------------------------
    D?                       'D' (optional (matching the most amount
                             possible))
--------------------------------------------------------------------------------
    C{0,3}                   'C' (between 0 and 3 times (matching the
                             most amount possible))
--------------------------------------------------------------------------------
  )                        end of \1
--------------------------------------------------------------------------------
  (                        group and capture to \2:
--------------------------------------------------------------------------------
    XC                       'XC'
--------------------------------------------------------------------------------
   |                        OR
--------------------------------------------------------------------------------
    XL                       'XL'
--------------------------------------------------------------------------------
   |                        OR
--------------------------------------------------------------------------------
    L?                       'L' (optional (matching the most amount
                             possible))
--------------------------------------------------------------------------------
    X{0,3}                   'X' (between 0 and 3 times (matching the
                             most amount possible))
--------------------------------------------------------------------------------
  )                        end of \2
--------------------------------------------------------------------------------
  (                        group and capture to \3:
--------------------------------------------------------------------------------
    [IΙ]                any character of: 'I', '&', '#', '9',
                             '2', '1', ';'
--------------------------------------------------------------------------------
    X                        'X'
--------------------------------------------------------------------------------
   |                        OR
--------------------------------------------------------------------------------
    [IΙ]                any character of: 'I', '&', '#', '9',
                             '2', '1', ';'
--------------------------------------------------------------------------------
    V                        'V'
--------------------------------------------------------------------------------
   |                        OR
--------------------------------------------------------------------------------
    V?                       'V' (optional (matching the most amount
                             possible))
--------------------------------------------------------------------------------
    [IΙ]{0,3}           any character of: 'I', '&', '#', '9',
                             '2', '1', ';' (between 0 and 3 times
                             (matching the most amount possible))
--------------------------------------------------------------------------------
  )                        end of \3
--------------------------------------------------------------------------------
  \b                       the boundary between a word char (\w) and
                           something that is not a word char
--------------------------------------------------------------------------------
  \.?                      '.' (optional (matching the most amount
                           possible))

关于python - 删除罗马数字,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/68048675/

相关文章:

php - preg_replace 的贪婪度

c# - 查找 HTML 文件中字符串的确切出现位置

python - 使用 Python 的 Launchctl 最小工作示例

python - 带有用户包的 py2app

java - 用于删除内部双引号的正则表达式

Javascript 使用正则表达式替换函数

ruby - 正则表达式:匹配这个字符串

python - 如何在 Databricks 上使用 HoloViews/hvPlot

python - 无法在 Alpine Docker容器上安装pyorc

python - RDD 沿袭/Spark 操作图的良好输出