xml - 混合内容和字符串操作清理

标签 xml xslt xslt-2.0

我正处于将基于 Word 的文档转换为 XML 的非常痛苦的过程中。我遇到了以下问题:

<?xml version="1.0" encoding="UTF-8"?>
<root>
    <p>
        <element>This one is taken care of.</element> Some more text. „<hi rend="italics">Is this a
            quote</hi>?” (Source). </p>

    <p>
        <element>This one is taken care of.</element> Some more text. „<hi rend="italics">This is a
            quote</hi>” (Source). </p>

    <p>
        <element>This one is taken care of.</element> Some more text. „<hi rend="italics">This is
            definitely a quote</hi>!” (Source). </p>

    <p>
        <element>This one is taken care of.</element> Some more text.„<hi rend="italics">This is a
            first quote</hi>” (Source). „<hi rend="italics">Sometimes there is a second quote as
            well</hi>!?” (Source). </p>

</root>

<p>节点具有混合内容。 <element>我在之前的迭代中已经处理过了。但现在的问题是引用和来源部分出现在 <hi rend= "italics"/> 中。部分作为文本节点。

我如何使用 XSLT 2.0 来:

  1. 匹配所有<hi rend="italics">紧接在最后一个字符为“„”的文本节点之前的节点?
  2. 输出<hi rend="italics">的内容作为<quote>...</quote> , 去掉引号(“„”和“”),但包含在 <quote/> 内紧随 <hi rend="italics"> 的 sibling 之后出现的任何问题和感叹号?
  3. <hi rend="italics"> 之后的“(”和“)”之间转换文本节点节点为 <source>...</source>没有括号。
  4. 包括最后的句号。

换句话说,我的输出应该是这样的:

<root>
<p>
<element>This one is taken care of.</element> Some more text. <quote>Is this a quote?</quote> <source>Source</source>.
</p>

<p>
<element>This one is taken care of.</element> Some more text. <quote>This is a quote</hi> <source>Source</source>.
</p>

<p>
<element>This one is taken care of.</element> Some more text. <quote>This is definitely a quote!</hi> <source>Source</source>.
</p>

<p>
<element>This one is taken care of.</element> Some more text. <quote>This is a first quote</quote> <source>Source</source>. <quote>Sometimes there is a second quote as well!?</quote> <source>Source</source>. 
</p>

</root>

我从来没有处理过像这样的混合内容和字符串操作,整个事情真的让我失望。我将非常感谢您的提示。

最佳答案

这个转换:

<xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    <xsl:output omit-xml-declaration="yes"/>

 <xsl:template match="node()|@*">
     <xsl:copy>
       <xsl:apply-templates select="node()|@*"/>
     </xsl:copy>
 </xsl:template>

 <xsl:template match=
  "hi[@rend='italics'
     and
      preceding-sibling::node()[1][self::text()[ends-with(., '„')]]
      ]">

  <quote>
    <xsl:value-of select=
     "concat(.,
             if(matches(following-sibling::text()[1], '^[?!]+'))
              then replace(following-sibling::text()[1], '^([?!]+).*$', '$1')
              else()
             )
      "/>
  </quote>
 </xsl:template>

 <xsl:template match="text()[true()]">
  <xsl:variable name="vThis" select="."/>
  <xsl:variable name="vThis2" select="translate($vThis, '„”?!', '')"/>

  <xsl:value-of select="substring-before(concat($vThis2, '('), '(')"/>
  <xsl:if test="contains($vThis2, '(')">
   <source>
    <xsl:value-of select=
      "substring-before(substring-after($vThis2, '('), ')')"/>
   </source>
   <xsl:value-of select="substring-after($vThis2, ')')"/>
  </xsl:if>
 </xsl:template>
</xsl:stylesheet>

应用于提供的 XML 文档时:

<root>
        <p>
            <element>This one is taken care of.</element> Some more text. „<hi rend="italics">Is this a
                quote</hi>?” (Source). </p>

        <p>
            <element>This one is taken care of.</element> Some more text. „<hi rend="italics">This is a
                quote</hi>” (Source). </p>

        <p>
            <element>This one is taken care of.</element> Some more text. „<hi rend="italics">This is
                definitely a quote</hi>!” (Source). </p>

        <p>
            <element>This one is taken care of.</element> Some more text.„<hi rend="italics">This is a
                first quote</hi>” (Source). „<hi rend="italics">Sometimes there is a second quote as
                well</hi>!?” (Source). </p>

</root>

产生想要的、正确的结果:

<root>
        <p>
            <element>This one is taken care of.</element> Some more text. <quote>Is this a
                quote?</quote> <source>Source</source>. </p>

        <p>
            <element>This one is taken care of.</element> Some more text. <quote>This is a
                quote</quote> <source>Source</source>. </p>

        <p>
            <element>This one is taken care of.</element> Some more text. <quote>This is
                definitely a quote!</quote> <source>Source</source>. </p>

        <p>
            <element>This one is taken care of.</element> Some more text.<quote>This is a
                first quote</quote> <source>Source</source>. <quote>Sometimes there is a second quote as
                well!?</quote> <source>Source</source>. </p>

</root>

关于xml - 混合内容和字符串操作清理,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/12690177/

相关文章:

xml - 带有命名空间的 XML 文档的 XPath

xslt - 在 XSLT 中显示所有全局参数

使用 XSLT 的 XML 到 XML 映射

xml - != 和 not ( = ) 之间的区别

xml - xsd :simpleContent的含义

xml - 在 Perl 中将类似 XML 的格式转换为 CSV

java - 空指针异常,我不知道为什么

xslt - XSL 根据条件添加属性

xml - XSLT 适用于 IE,不适用于 Chrome 或 Firefox

xslt - 将条目合并到页码