c# - 如何将一串要点(带有标题和正文内容)拆分成一个多维数组?

标签 c# .net regex string split

我有一些从 PDF 文档中提取的文本,该文档包含一个项目符号列表,其中包含如下内容:

3 BILL REFERRED TO MAIL COMMITTEE
Mr Fitzgibbon (Chief Government Whip), by leave, moved—That the Tax Laws Amendment (2011 Measures No. 7) Bill 2011 be referred to the Main Committee for further consideration. Question—put and passed.
4 CORPORATIONS AMENDMENT (FUTURE OF FINANCIAL ADVICE) BILL 2011
Mr Shorten (Minister for Financial Services and Superannuation), pursuant to notice, presented a Bill for an Act to amend the law in relation to financial advice, and for related purposes. Document Mr Shorten presented an explanatory memorandum to the bill. Bill read a first time. Mr Shorten moved—That the bill be now read a second time. Debate adjourned (Mr Randall), and the resumption of the debate made an order of the day for the next sitting.
5 TAX LAWS AMENDMENT (2011 MEASURES NO. 8) BILL 2011
Mr Shorten (Minister for Financial Services and Superannuation) presented a Bill for an Act to amend the law relating to taxation, and for related purposes. Document

我需要将它们分开,这样每个项目符号点都像这样:

[0,0] = Title
[0,1] = Body
[1,0] = Title
[1,1] = Body

我修改了示例以包含一些真实世界的内容。

如有任何帮助,我们将不胜感激。
我正在使用 .NET 框架 C#。

最佳答案

您可以使用 LINQ:

var result = input
    .Split(new[] { "\r\n" }, StringSplitOptions.None)
    .Where(x => !string.IsNullOrWhiteSpace(x))
    .GroupAdjacent((g, x) => !char.IsDigit(x[0]))
    .Select(g => new
    {
        Title = g.First().Trim(),
        Body = string.Join(" ", g.Skip(1).Select(x => x.Trim()))
    })
    .ToArray();

示例:

string input = @"3 BILL REFERRED TO MAIL COMMITTEE
Mr Fitzgibbon (Chief Government Whip), by leave, moved—That the
Tax Laws Amendment (2011 Measures No. 7) Bill 2011 be referred
to the Main Committee for further consideration. Question—put
and passed.

4 CORPORATIONS AMENDMENT (FUTURE OF FINANCIAL ADVICE) BILL 2011
Mr Shorten (Minister for Financial Services and Superannuation),
pursuant to notice, presented a Bill for an Act to amend the law
in relation to financial advice,and for related purposes. Mr
Shorten presented an explanatory memorandum to the bill. Bill
read a first time. Mr Shorten moved—That the bill be now read
a second time. Debate adjourned (Mr Randall), and the resumption
of the debate made an order of the day for the next sitting.

5 TAX LAWS AMENDMENT (2011 MEASURES NO. 8) BILL 2011
Mr Shorten (Minister for Financial Services and Superannuation)
presented a Bill for an Act to amend the law relating to
taxation, and for related purposes.";

输出:

result[0] == { Title = "3 BILL REFERRED ...", Body = "Mr Fitzgibbon ..." }
result[1] == { Title = "4 CORPORATIONS ...",  Body = "Mr Shorten ..." }
result[2] == { Title = "5 TAX LAWS ...",      Body = "Mr Shorten ..." }

扩展方法:

public static IEnumerable<IEnumerable<T>> GroupAdjacent<T>(
    this IEnumerable<T> source, Func<IEnumerable<T>, T, bool> adjacent)
{
    var g = new List<T>();
    foreach (var x in source)
    {
        if (g.Count != 0 && !adjacent(g, x))
        {
            yield return g;
            g = new List<T>();
        }
        g.Add(x);
    }
    yield return g;
}

关于c# - 如何将一串要点(带有标题和正文内容)拆分成一个多维数组?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/8462780/

相关文章:

c# - 将类转换为数组

c# - 为什么对 C# lambda 函数如此大肆宣传?

.net - 使用 .NET 客户端获取 google 翻译 API 时如何解决超出每日限制的问题?

c# - 来自 WCF 请求的 IPrincipal

c# - 在 CAB 中部署 C# ActiveX 以供 Internet Explorer 使用

.net - Windows 事件日志中引用的 "Framework Version"是什么?

java - 检查特殊字符 "$"的正则表达式

javascript - 鼠标悬停/输入时的 Powertips 是否与数据表兼容?

regex - 从 curl 的输出中提取模式

c# - 轮盘赌选择程序