我需要将 HTML 字符串转换为纯文本(最好使用 HTML Agility pack)。使用适当的空格,尤其是适当的换行符。
“适当的换行符”是指这段代码:
<div>
<div>
<div>
line1
</div>
</div>
</div>
<div>line2</div>
应转换为
line1
line2
即只有一个换行符。
我见过的大多数解决方案只是简单地转换所有 <div> <br> <p>
标记为 \n
很明显,s*cks。
对于 C# 的 html-to-plaintext 渲染逻辑有什么建议吗?不是完整的代码,至少像“用换行符替换所有关闭的 DIV,但前提是下一个兄弟也不是 DIV”这样的常见逻辑答案确实有帮助。
我尝试过的事情:简单地得到 .InnerText
属性(显然错误),正则表达式(缓慢,痛苦,很多黑客,正则表达式比 HtmlAgilityPack 慢 12 倍 - 我测量过),这个 solution 和类似的(返回更多的换行符然后需要)
最佳答案
下面的代码与提供的示例一起正常工作,甚至可以处理一些奇怪的东西,例如 <div><br></div>
,还有一些需要改进的地方,但基本思想是存在的。查看评论。
public static string FormatLineBreaks(string html)
{
//first - remove all the existing '\n' from HTML
//they mean nothing in HTML, but break our logic
html = html.Replace("\r", "").Replace("\n", " ");
//now create an Html Agile Doc object
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
//remove comments, head, style and script tags
foreach (HtmlNode node in doc.DocumentNode.SafeSelectNodes("//comment() | //script | //style | //head"))
{
node.ParentNode.RemoveChild(node);
}
//now remove all "meaningless" inline elements like "span"
foreach (HtmlNode node in doc.DocumentNode.SafeSelectNodes("//span | //label")) //add "b", "i" if required
{
node.ParentNode.ReplaceChild(HtmlNode.CreateNode(node.InnerHtml), node);
}
//block-elements - convert to line-breaks
foreach (HtmlNode node in doc.DocumentNode.SafeSelectNodes("//p | //div")) //you could add more tags here
{
//we add a "\n" ONLY if the node contains some plain text as "direct" child
//meaning - text is not nested inside children, but only one-level deep
//use XPath to find direct "text" in element
var txtNode = node.SelectSingleNode("text()");
//no "direct" text - NOT ADDDING the \n !!!!
if (txtNode == null || txtNode.InnerHtml.Trim() == "") continue;
//"surround" the node with line breaks
node.ParentNode.InsertBefore(doc.CreateTextNode("\r\n"), node);
node.ParentNode.InsertAfter(doc.CreateTextNode("\r\n"), node);
}
//todo: might need to replace multiple "\n\n" into one here, I'm still testing...
//now BR tags - simply replace with "\n" and forget
foreach (HtmlNode node in doc.DocumentNode.SafeSelectNodes("//br"))
node.ParentNode.ReplaceChild(doc.CreateTextNode("\r\n"), node);
//finally - return the text which will have our inserted line-breaks in it
return doc.DocumentNode.InnerText.Trim();
//todo - you should probably add "&code;" processing, to decode all the and such
}
//here's the extension method I use
private static HtmlNodeCollection SafeSelectNodes(this HtmlNode node, string selector)
{
return (node.SelectNodes(selector) ?? new HtmlNodeCollection(node));
}
关于c# - 使用正确的换行符将 HTML 转换(呈现)为文本,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/29995333/