regex - 用于提取多行 LaTeX 章节名称的 Perl 正则表达式

我很难弄清楚如何执行正则表达式替换来清理 LaTeX 文件中的某些文本。 LaTeX 文件看起来像

\chapter{\texorpdfstring{{II} {The Chapter 
Title}}{II The Chapter Title}}

令人烦恼的是，这是一个多行章节声明，新行几乎可以出现在任何地方。我无法使用常用的<>习惯用法是逐行读取文件并执行直接的正则表达式。

相反，我正在尝试这个:

#!/usr/bin/perl -i.old     # In-place edit, backup as '.old'
use strict;
use warnings;

use Path::Tiny;

my $filename = shift or die "Usage: $0 FILENAME";
my $content = path($filename)->slurp_utf8;

$content =~ s|\\chapter\{.*\{[IVXLCDM]*\s*(.*)\}\}|\\chapter{$1}|gms;
path($filename)->spew_utf8($content);

但是，正则表达式太贪婪了，它从第一个 \chapter 开始匹配。声明并在最后 chapter 结束声明。我想要的只是

删除 \texorpdfstring .
删除罗马数字
删除多次出现的章节标题

以便我的替换

\chapter{\texorpdfstring{{I} {The First 
Chapter}}{I The First Chapter}}

It was the best of times.

\chapter{\texorpdfstring{{II} {The Second 
Chapter}}{II The Second Chapter}}

It was the worst of times.

结果

\chapter{The First Chapter}

It was the best of times.

\chapter{The Second Chapter}

It was the worst of times.

我现在能做什么？

编辑:我更改了演示文本。

如果我正确理解@zdim，他会写下替换而不转义大括号 {}，以便更容易验证。很公平。我尝试了 @zdim 的解决方案，但它输出:

\chapter{The First
Chapter}

It was the worst of times.

最佳答案

如果您只能拥有显示的 {...} 对

s/\\chapter{\\texorpdfstring{{ .*? }\s*{ (.*?) }}\s*{.*?}}/\\chapter{$1}/gsx;

或

s/(\\chapter){\\texorpdfstring{{.*?}\s*{(.*?)}}\s*{.*?}}/${1}{$2}/gs;

其中 ${1}(对于 $1)是语法所必需的，因为 $1{... 将被解释为一个值%1。

或者更确切地说

s/\\chapter\K{\s*\\texorpdfstring{{.*?}\s*{(.*?)}}\s*{.*?}}/{$1}/gs

其中\K form of lookbehind放弃之前的比赛。我仍然留下 { 重新输入以获得可能更清晰的替换部分。

请在可能有空格的地方添加 \s*。

另请注意 Path::Tiny::edit_utf8

path($filename)->edit_utf8( sub { s/.../.../gs } );  # regex as above

它将匿名子应用到slurped文件，而不是edit_lines。

如果花括号表达式可以更自由地嵌套(比如使用 {\em ... } 等)，则需要一种更加系统的方法。例如，参见Text::Balanced并搜索“嵌套分隔符”。

一些正则表达式资源

Perl 文档

perlretut ，教程
perlrequick ，快速入门介绍
perlre 、语法完整说明
perlreref ，快速引用(其 See Also 部分本身很有用)

堆栈溢出

Regex info具有资源的入口门户
Reference: What does this regex mean?包含 SO 帖子链接的庞大常见问题解答列表
Learning Regular expressions最后提供一长串资源的概述

Regular-Expressions.info

关于regex - 用于提取多行 LaTeX 章节名称的 Perl 正则表达式，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/48509953/

regex - 用于提取多行 LaTeX 章节名称的 Perl 正则表达式

上一篇：r Shiny : allow user to download shiny output in a desired directory

下一篇：json - 读取大 json 文件的第 n 个元素