例如,这是我的字符串(它是来自 html 的文本)
html_text = """
TABLE OF CONTENTS
PART I
| ITEM 1. BUSINESS
| ITEM 1A. RISK FACTORS
| ITEM 1B. UNRESOLVED CONFLICTS
| ITEM 2. PROPERTIES
| ITEM 3. LEGAL PROCEEDINGS
We believe that relations with our employees are good; however, the competition
for such personnel is intense, and the loss of key personnel could have a
material adverse impact on our results of operations and financial condition.
ITEM 1A. | RISK FACTORS
Set forth below and elsewhere in this report and in other documents we file
with the SEC are descriptions of the risks and uncertainties that could cause
our actual results to differ materially from the results contemplated by the
forward-looking statements contained in this report.
ITEM 1B. UNRESOLVED CONFLICTS
Our future revenue, gross margins, operating results and net income are
difficult to predict and may materially"""
我编写了一个正则表达式来捕获“ITEM 1A. RISK FACTORS”(不是来自目录)
re.search(r"(ITEM.*1A)*.+(RISK FACTORS).*\n+(?!\w)(?!.*ITEM.*1B)", html_text)
和另一个正则表达式来捕获“ITEM 1B. UNRESOLVED CONFLICTS”(不是来自目录)
re.search(still trying to figure this out)
我想捕获这两个匹配之间出现的所有文本。 最终的文本字符串应如下所示:
final_text = """ ITEM 1A. | RISK FACTORS
Set forth below and elsewhere in this report and in other documents we file
with the SEC are descriptions of the risks and uncertainties that could cause
our actual results to differ materially from the results contemplated by the
forward-looking statements contained in this report."""
最佳答案
这可能对你有用:
re.compile(r"^( ITEM 1A. \| RISK FACTORS.+\n(?:\n.+)+)", re.MULTILINE)
可以在此处查看 Regex101但请注意,由于没有使用 re.compile(REGEXP, REGEXPOPTION)
设置,它的工作方式有所不同。
关于python - 我想捕获两个正则表达式匹配之间出现的文本,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/56672873/