我有一些文本片段,我想将它们分成几行。问题是它们已经被格式化,所以我不能像通常那样拆分:
_text = text.Split(new[] { '\n' }, StringSplitOptions.RemoveEmptyEntries)
.ToArray();
这是示例文本:
adj 1: around the middle of a scale of evaluation of physical
measures; "an orange of average size"; "intermediate
capacity"; "a plane with intermediate range"; "medium
bombers" [syn: {average}, {intermediate}]
2: (of meat) cooked until there is just a little pink meat
inside
n 1: a means or instrumentality for storing or communicating
information
2: the surrounding environment; "fish require an aqueous
medium"
3: an intervening substance through which signals can travel as
a means for communication
4: (bacteriology) a nutrient substance (solid or liquid) that
is used to cultivate micro-organisms [syn: {culture medium}]
5: an intervening substance through which something is
achieved; "the dissolving medium is called a solvent"
6: a liquid with which pigment is mixed by a painter
7: (biology) a substance in which specimens are preserved or
displayed
8: a state that is intermediate between extremes; a middle
position; "a happy medium"
格式总是一样的:
- 可能存在 1-3 个字母的单词
- 1-10号
- 冒号
- 空间
- 可能出现在多行的文本。
因此在这种情况下,换行符必须类似于 1-3 个字符的单词,后跟 1-2 个字符的数字,然后是 :
有人可以给我一些建议,告诉我如何使用拆分或其他方法做到这一点吗?
更新:Steven 的回答,但不太确定如何将其纳入我的职能。在这里,我展示了我的原始代码,下面是史蒂文建议的答案,但缺少我不确定的部分:
public parser(string text)
{
//_text = text.Split(new[] { '\n' }, StringSplitOptions.RemoveEmptyEntries)
// .ToArray();
string pattern = @"(\w{1,3} )?1?\d: (?<line>[^\r\n]+)(\r?\n\s+(?<line>[^\r\n]+))*";
foreach (Match m in Regex.Matches(text, pattern))
{
if (m.Success)
{
string entry = string.Join(Environment.NewLine,
m.Groups["line"].Captures.Cast<Capture>().Select(x => x.Value));
// ...
}
}
}
出于测试目的,这里是不同格式的文本:
"medium\n adj 1: 围绕物理\n 措施评估范围的中间;\"平均大小的橙子\";\"中间\n 容量\";\"具有中间范围的平面\";\"medium\n bombers\"[syn: {average}, {intermediate}]\n 2: (of meat) cooked until there is just a little pink meat inside\n n 1: 一种手段或工具用于存储或通信\n 信息\n 2:周围环境;“鱼需要水性介质\n”\n 3:信号可以通过其传播\n 作为通信手段\n 4:(细菌学) 一种营养物质(固体或液体),\n 用于培养微生物 [syn: {culture medium}]\n 5: 一种中间物质,通过它\n 可以达到某种目的;\n 溶解介质称为 a solvent\"\n 6: 画家将颜料与颜料混合的液体\n 7: (生物学) 一种物质,其中特定imens 被保留或显示\n 8:介于极端之间的状态;中间位置;\"a happy medium\"\n 9: 充当生者与死者之间的中介的人;\n\"他咨询了几种媒介\"[syn: {spiritualist}]\n 10: 向公众广泛传播的传播\n [syn: {mass medium}]\n 11: 你特别适合的职业;\"在法律上,他找到了自己真正的职业\"[syn: {metier}]\n [also: {media} (pl)]\n"
最佳答案
Regex 对此非常有用。例如:
public parser(string text)
{
string pattern = @"(?<line> (\w{1,3} )?1?\d: [^\r\n]+)(\r?\n(?! (\w{1,3} )?1?\d: [^\r\n]+)\s+(?<line>[^\r\n]+))*";
var entries = new List<string>();
foreach (Match m in Regex.Matches(text, pattern))
if(m.Success)
entries.Add(string.Join(" ",
m.Groups["line"].Captures.Cast<Capture>().Select(x=>x.Value)));
_text = entries.ToArray();
}
关于c# - 如何根据正则表达式将文本拆分为多行?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/38126964/