我对我编写的字符串操作方法有疑问。该方法的目的是在一个长字符串中寻找链接标签,并重新格式化它们的 hrefs。
为了提供一些上下文,我正在解析 CD 上的大量 HTML 文件,并将结果整理成单独项目中网站上的 xml 文件(我将其作为控制台应用程序的一部分编写) . html 文件包含说明文本,其中包含与 CD 上的文件相关的链接,我需要更改 hrefs 以与信息所在的网站相关。
如果只有一个链接标签,下面的代码似乎工作得很好,但是传递了两个,输出非常困惑。奇怪的是,Visual Studio 的正则表达式编辑器声称下面的 linkTag 正则表达式仅匹配链接标签,但当涉及到用正确的 href 替换链接时,它会在指令字符串中的不同位置插入链接片段。
额外的 regex 的 alphaDir 的原因是我最终会扩展这个方法来纠正具有不同起始 href 的链接。我们正在谈论解析数千个 html 文件,但这种格式是迄今为止最常见的。
我对此有点不知所措,因为我是一个非常正则表达式的初学者,并且自己在下面写了所有正则表达式,所以对这些中的任何一个的任何想法也会很棒。
典型输入字符串
Hold 1st <strong><a href="../f/fist_hand.html">FIST</a></strong> hand, back outward
& fingers forward, and put 2nd <strong><a href="../f/fist_hand.html">FIST</a></strong> hand, back forward
& fingers inward, with lower knuckle of its 4th finger on
lower knuckle of 1st thumb; then slide 2nd hand forwards one
hand's length.
方法
static string instructions(string instructions)
{
Regex Spaces = new Regex(@"\s+|\n|\r");
Regex linkTag = new Regex(@"<a(.*?)>(.*?)<\/a>");
Regex linkTagHtml = new Regex(@"<a(.*?)>|<\/a>");
Regex hrefAttr = new Regex("href=\"(.)*?\"");
Regex alphaDir = new Regex(@"/([a-z])?/");
string signName = string.Empty;
char alphaChar;
string replacementLinkTag = string.Empty;
string replacementHref = string.Empty;
instructions = Spaces.Replace(instructions, " ");
MatchCollection matches = linkTag.Matches(instructions);
foreach (Match link in matches)
{
Match alphaDirMatch = alphaDir.Match(link.Value.ToString());
if (alphaDirMatch.Success)
{
Match hrefAttrMatch = hrefAttr.Match(link.Value.ToString());
if (hrefAttrMatch.Success)
{
signName = linkTagHtml.Replace(link.Value.ToString(), string.Empty).ToLower().Trim();
signName = signName.Replace(" ", "_");
alphaChar = signName[0];
replacementHref = "href=\"/pages/displayc.aspx?c=dictionary&alpha=" + alphaChar.ToString() +"&sign=" + signName + "\"";
replacementLinkTag = hrefAttr.Replace(link.Value.ToString(), replacementHref);
instructions = instructions.Remove(link.Index, link.Length);
instructions = instructions.Insert(link.Index, replacementLinkTag);
}
}
}
return instructions;
}
当前输出字符串
Hold 1st <strong><a href="/pages/displayc.aspx?c=dictionary&alpha=f&sign=fist">FIST</a></strong> hand, back outward & finge<a href="/pages/displayc.aspx?c=dictionary&alpha=f&sign=fist">FIST</a>f="../f/fist_hand.html">FIST</a></strong> hand, back forward & fingers inward, with lower knuckle of its 4th finger on lower knuckle of 1st thumb; then slide 2nd hand forwards one hand's length.
期望的输出字符串
Hold 1st <strong><a href="/pages/displayc.aspx?c=dictionary&alpha=f&sign=fist">FIST</a></strong> hand, back outward & fingers forward, and put 2nd <strong><a href="/pages/displayc.aspx?c=dictionary&alpha=f&sign=fist">FIST</a></strong> hand, back forward & fingers inward, with lower knuckle of its 4th finger on lower knuckle of 1st thumb; then slide 2nd hand forwards one hand's length.
解决方案 - 感谢 Oded 的建议!
我使用 HtmlAgilityPack 将指令字符串加载为 html,并找到将这些存储在 HtmlNodeCollection 中的链接标记,遍历每个并获取 href 值,然后进行编辑。
对于那些感兴趣的人来说,代码最终看起来像这样:
static string instructions(string instructions)
{
char alphaChar;
Regex Spaces = new Regex(@"\s+|\n|\r");
Regex alphaDir = new Regex(@"/([a-z])?/");
string signName = string.Empty;
string replacementHref = string.Empty;
instructions = Spaces.Replace(instructions, " ");
HtmlDocument instr = new HtmlDocument();
instr.LoadHtml(instructions);
HtmlNodeCollection links = instr.DocumentNode.SelectNodes("//a");
if (links != null)
{
foreach (HtmlNode link in links)
{
string href = link.GetAttributeValue("href", string.Empty);
if (!string.IsNullOrWhiteSpace(href))
{
Match alphaDirMatch = alphaDir.Match(href);
if (alphaDirMatch.Success)
{
signName = Regex.Replace(href, "(.)*?/([a-z])?/|(.html)?", string.Empty);
signName = signName.Replace(" ", "_");
alphaChar = signName[0];
replacementHref = "/pages/displayc.aspx?c=dictionary&alpha=" + alphaChar.ToString() + "&sign=" + signName;
link.SetAttributeValue("href", replacementHref);
}
}
}
}
instructions = instr.DocumentNode.InnerHtml.ToString();
return instructions;
}
最佳答案
我建议尝试 HTML Agility Pack解析和查询您的 HTML 文档。
使用 RegEx
可能相当脆弱,如果文档不是很统一可能是一种行不通的方法 - 参见 this SO answer .
关于c# - 正则表达式问题和 c#,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/8166702/