python - 在python中，如何使用正则表达式提取字符串？

我想编写一个简单的 Markdown 解析器函数，它将接受单行 Markdown 并转换为适当的 HTML。为了简单起见，我只想支持 atx 语法中 markdown 的一项功能: header 。

header 由 (1-6) 个哈希值指定，后跟空格，然后是文本。哈希值的数量决定了 HTML 输出的 header 级别。示例

# Header will become <h1>Header</h1>

## Header will become <h2>Header</h2>

###### Header will become <h6>Header</h6>

规则如下

header 内容只能出现在初始主题标签加上空格字符之后。

无效的 header 应该作为收到的 Markdown 返回，无需翻译。

在结果输出中应忽略标题内容和主题标签前后的空格。

这是我编写的代码。

import re
def markdown_parser(markdown):
    results =''
    pattern = re.compile("#+\s")
    matches = pattern.search(markdown.strip())
    if (matches != None):
        tag = matches[0]
        hashTagLen = len(tag) - 1
        htmlTag = "h" + str(hashTagLen)
        content = markdown.strip()[(hashTagLen + 1):]
        results = "<" + htmlTag + ">" + content + "</" + htmlTag + ">"
    else:
        results = markdown
    return results

当我运行此代码时，发生了如下异常。

Unhandled Exception: '_sre.SRE_Match' object is not subscriptable

我不确定为什么会发生此错误。

当我在 shell 上运行该脚本时，它运行良好。但是当我在unittest环境中运行它(导入unittest)时，出现了错误。

请帮助我。

最佳答案

该代码看起来非常冗长，并且很多逻辑可以在正则表达式中执行。

如果你看一下用perl编写的原始markdown库，你会发现只需要一个模式，然后，从第一个捕获组，你就可以知道它是什么样式的标题。

The original implementation is here

sub _DoHeaders {
my $text = shift;

# Setext-style headers:
#     Header 1
#     ========
#  
#     Header 2
#     --------
#
$text =~ s{ ^(.+)[ \t]*\n=+[ \t]*\n+ }{
    "<h1>"  .  _RunSpanGamut($1)  .  "</h1>\n\n";
}egmx;

$text =~ s{ ^(.+)[ \t]*\n-+[ \t]*\n+ }{
    "<h2>"  .  _RunSpanGamut($1)  .  "</h2>\n\n";
}egmx;


# atx-style headers:
#   # Header 1
#   ## Header 2
#   ## Header 2 with closing hashes ##
#   ...
#   ###### Header 6
#
$text =~ s{
        ^(\#{1,6})  # $1 = string of #'s
        [ \t]*
        (.+?)       # $2 = Header text
        [ \t]*
        \#*         # optional closing #'s (not counted)
        \n+
    }{
        my $h_level = length($1);
        "<h$h_level>"  .  _RunSpanGamut($2)  .  "</h$h_level>\n\n";
    }egmx;

return $text;

}

除非由于某种原因你不能，否则最好使用 markdown 库，因为它是原始库的实现，有很多缺点。

你可以看看 Markdown-Python 库是如何实现的 here

class HashHeaderProcessor(BlockProcessor):
""" Process Hash Headers. """

# Detect a header at start of any line in block
RE = re.compile(r'(^|\n)(?P<level>#{1,6})(?P<header>.*?)#*(\n|$)')

def test(self, parent, block):
    return bool(self.RE.search(block))

def run(self, parent, blocks):
    block = blocks.pop(0)
    m = self.RE.search(block)
    if m:
        before = block[:m.start()]  # All lines before header
        after = block[m.end():]     # All lines after header
        if before:
            # As the header was not the first line of the block and the
            # lines before the header must be parsed first,
            # recursively parse this lines as a block.
            self.parser.parseBlocks(parent, [before])
        # Create header using named groups from RE
        h = util.etree.SubElement(parent, 'h%d' % len(m.group('level')))
        h.text = m.group('header').strip()
        if after:
            # Insert remaining lines as first block for future parsing.
            blocks.insert(0, after)
    else:  # pragma: no cover
        # This should never happen, but just in case...
        logger.warn("We've got a problem header: %r" % block)

关于python - 在python中，如何使用正则表达式提取字符串？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/44930029/

python - 在python中，如何使用正则表达式提取字符串？

上一篇：python - statsmodels.formula.api 导入错误 ('cannot import name ' TimeSeries''。)

下一篇：python - 继承setUp方法Python Unittest