java - 解析 XML,如果 CDATA 不包含 HTML 标记,则不返回任何字符串

标签 java android html xml

我正在使用 DOM 解析器来读取 rss feed,例如 android 中的这个:

...
<item cbc:type="story" cbc:deptid="2.663" cbc:syndicate="true">
<title>
<![CDATA[
Asian carp have reproduced in Great Lakes watershed
]]>
</title>
<link>
http://www.cbc.ca/news/canada/windsor/asian-carp-have-reproduced-in-great-lakes-watershed-1.2286554?cmp=rss
</link>
<guid isPermaLink="false">1.2286554</guid>
<pubDate>Tue, 29 Oct 2013 08:06:48 EDT</pubDate>
<description>
<![CDATA[
<img title='Fisheries and Oceans Canada and the Ontario Ministry of Natural Resources confirmed one grass carp was caught in the Grand River near Lake Erie. ' height='259' alt='hi-20130502-grass_carp-dfo-852' width='460' src='http://i.cbc.ca/1.1663916.1379078358!/httpImage/image.jpg_gen/derivatives/16x9_460/hi-20130502-grass-carp-dfo-852.jpg' /> <p>Scientists said Monday they have documented for the first time that an Asian carp species has successfully reproduced within the Great Lakes watershed, an ominous development in the struggle to slam the door on the hungry invaders that could threaten native fish.</p>
]]>
</description>
</item>
...

xmlParser.class:

public class xmlParser {

public Document getDomElement(String rssFilePath, String fileName){
    Log.d("GET", ""+rssFilePath+fileName);
    Document doc = null;
    DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
    dbf.setCoalescing(true);
    FileInputStream fis;
    try {

        DocumentBuilder db = dbf.newDocumentBuilder();

        File tmp2 = new File (rssFilePath,"/"+ fileName);
        fis = new FileInputStream(tmp2);

        InputSource is = new InputSource();
            is.setByteStream(fis);
            doc = db.parse(is); 
        } catch (ParserConfigurationException e) {
            Log.e("Error: ", e.getMessage());
            return null;
        } catch (SAXException e) {
            Log.e("Error: ", e.getMessage());
            return null;
        } catch (IOException e) {
            Log.e("Error: ", e.getMessage());
            return null;
        }
            // return DOM
   // Log.d("DOM", doc.toString());
        return doc;

}

public String getValue(Element item, String str) { 
    NodeList n = item.getElementsByTagName(str);        
    return this.getElementValue(n.item(0));
}

public final String getElementValue( Node elem ) {
         Node child;
         if( elem != null){
             if (elem.hasChildNodes()){
                 for( child = elem.getFirstChild(); child != null; child = child.getNextSibling() ){
                     if( child.getNodeType() == Node.TEXT_NODE  ){
                         return child.getNodeValue();
                     }
                 }
             }
         }
         return "";
  } 
}

来 self 的主要 Activity :

//Parse the XML content
            xmlParser parser = new xmlParser();
            Log.d(TAG, "1");
            Document rssDoc = parser.getDomElement(rssFilePath, rssFileName);
            Log.d(TAG, "2");
            final NodeList nl = rssDoc.getElementsByTagName(KEY_ITEM);
            Log.d(TAG, "3");

            //Make it all look nice and strip HTML
            for (int i = 0; i < nl.getLength(); i++){

                Element e = (Element) nl.item(i);

                String noHtmlTitle = parser.getValue(e, KEY_TITLE).toString().replaceAll("\\<.*?>", "");
                noHtmlTitle = noHtmlTitle.replaceAll("/n", "");

                noHtmlTitle = noHtmlTitle.trim();

                titles.add(noHtmlTitle);

                String noHtmlDesc = parser.getValue(e, KEY_DESC).toString().replaceAll("\\<.*?>", "");
                noHtmlDesc = noHtmlDesc.trim(); 
                descs.add("\n" + noHtmlDesc);

            }

但是,当此代码与上述“title”“/title”标签一起出现时,它会返回一个空字符串。这似乎与“title”标签不包含任何 HTML 标签有关。

如何从标题标签中检索可用的字符串?

如果我排除了任何必需的数据,请告诉我。

编辑:

按照 blahdiblah 的说法,返回的数据类型是 CDATA_SECTION_NODE。我修改了 getElementValue 方法以包含此数据类型:

public final String getElementValue( Node elem ) {
         Node child;
         if( elem != null){
             if (elem.hasChildNodes()){
                 for( child = elem.getFirstChild(); child != null; child = child.getNextSibling() ){
                     if( child.getNodeType() == Node.TEXT_NODE  ){
                         return child.getNodeValue();
                     }else if (child.getNodeType() == Node.CDATA_SECTION_NODE){
                         return child.getNodeValue();
                     }
                 }
             }
         }
         return "";
  } 

最佳答案

您的 XMLParser 仅返回文本节点 ( child.getNodeType() == Node.TEXT_NODE ) 的内容,但 <title>类型为CDATA_SECTION_NODE .

请注意,标题几乎肯定会作为 CDATA 而不是纯文本发送,以便它可以包含 HTML 格式和其他奇怪字符。确保使用各种输入进行测试,以确保正确解析它。

关于java - 解析 XML,如果 CDATA 不包含 HTML 标记,则不返回任何字符串,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/19667116/

相关文章:

android - 比较 Android 中的位图图像

javascript - 如何在网页中固定滚动?

javascript - 获取页面滚动上元素的坐标

java - getByte() 为不同的字符串返回相同的字节[]

Java程序找到字符串中出现次数最多的字符?

android - 检查用户在当前 session 中使用了哪种身份验证方法

android - 陀螺为什么叫陀螺?

Java JVM 热交换行为

java - GAE Objectify如何将查询结果用于另一个查询

html - 无法在网格项上精确定位固定位置叠加层