python - 我想捕获两个正则表达式匹配之间出现的文本

标签 python regex

例如,这是我的字符串(它是来自 html 的文本)

html_text = """
TABLE OF CONTENTS

PART I  
| ITEM 1. BUSINESS  
| ITEM 1A. RISK FACTORS  
| ITEM 1B. UNRESOLVED CONFLICTS  
| ITEM 2. PROPERTIES  
| ITEM 3. LEGAL PROCEEDINGS  

    We believe that relations with our employees are good; however, the competition
    for such personnel is intense, and the loss of key personnel could have a
    material adverse impact on our results of operations and financial condition.

    ITEM  1A. |  RISK FACTORS  

    Set forth below and elsewhere in this report and in other documents we file
    with the SEC are descriptions of the risks and uncertainties that could cause
    our actual results to differ materially from the results contemplated by the
    forward-looking statements contained in this report.

    ITEM 1B. UNRESOLVED CONFLICTS

    Our future revenue, gross margins, operating results and net income are
    difficult to predict and may materially"""

我编写了一个正则表达式来捕获“ITEM 1A. RISK FACTORS”(不是来自目录)

re.search(r"(ITEM.*1A)*.+(RISK FACTORS).*\n+(?!\w)(?!.*ITEM.*1B)", html_text)

和另一个正则表达式来捕获“ITEM 1B. UNRESOLVED CONFLICTS”(不是来自目录)

re.search(still trying to figure this out)

我想捕获这两个匹配之间出现的所有文本。 最终的文本字符串应如下所示:

final_text = """    ITEM  1A. |  RISK FACTORS  

    Set forth below and elsewhere in this report and in other documents we file
    with the SEC are descriptions of the risks and uncertainties that could cause
    our actual results to differ materially from the results contemplated by the
    forward-looking statements contained in this report."""

最佳答案

这可能对你有用:

re.compile(r"^(    ITEM  1A. \|  RISK FACTORS.+\n(?:\n.+)+)", re.MULTILINE)

可以在此处查看 Regex101但请注意,由于没有使用 re.compile(REGEXP, REGEXPOPTION) 设置,它的工作方式有所不同。

关于python - 我想捕获两个正则表达式匹配之间出现的文本,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/56672873/

相关文章:

regex - PowerShell正则表达式可过滤{{和}}之间的值

regex - 如何使用 sed、awk 或其他 OS X 工具替换文件(JSON 格式)中的多行 block ?

python - 读取文件并与字典进行比较

Python - 如何通过列表创建多个属性

Python:导入cx_Oracle导入错误:没有名为cx_Oracle的模块错误被抛出

ruby - 从地址中删除街道后缀

python - 冗长的正则表达式注释中的连字符会导致错误

javascript - 用于查找小数/ float 的正则表达式?

python - Tensorflow 相同的训练精度持续

python - 从Python Dict返回具有日期时间范围的多个值