文件格式
POS ID PosScore NegScore SynsetTerms Gloss
a 00001740 0.125 0 able#1" able to swim"; "she was able to program her computer";
a 00002098 0 0.75 unable#1 "unable to get to town without a car";
a 00002312 0 0 dorsal#2 abaxial#1 "the abaxial surface of a leaf is the underside or side facing away from the stem"
a 00002843 0 0 basiscopic#1 facing or on the side toward the base
a 00002956 0 0.23 abducting#1 abducent#1 especially of muscles; drawing away from the midline of the body or from an adjacent part
a 00003131 0 0 adductive#1 adducting#1 adducent#1 especially of muscles;
在此文件中,我想提取(ID、PosScore、NegScore 和 SynsetTerms) 字段。 (ID,PosScore,NegScore)字段数据提取很容易,我对这些字段的数据使用以下代码。
Regex expression = new Regex(@"(\t(\d+)|(\w+)\t)");
var results = expression.Matches(input);
foreach (Match match in results)
{
Console.WriteLine(match);
}
Console.ReadLine();
它给出了正确的结果,但是归档的SynsetTerms产生了一个问题,因为有些行有两个或更多单词,那么如何组织单词并对抗它PosScore和NegScore。
例如,第五行有两个单词 abducting#1
和 abducent#1
但两者的分数相同。
那么获取 Word 及其分数的行的正则表达式是什么,例如:
Word PosScore NegScore
abducting#1 0 0.23
abducent#1 0 0.23
最佳答案
非正则表达式、字符串分割版本可能更容易:
var data =
lines.Split(new[] {Environment.NewLine}, StringSplitOptions.RemoveEmptyEntries)
.Skip(1)
.Select(line => line.Split('\t'))
.SelectMany(parts => parts[4].Split().Select(word => new
{
ID = parts[1],
Word = word,
PosScore = decimal.Parse(parts[2]),
NegScore = decimal.Parse(parts[3])
}));
关于c# - C# 中匹配不同格式句子的正则表达式,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/15142156/