php - 如何解析<media :content> tag in RSS with simplexml

标签 php tags rss simplexml media

我的 RSS 结构来自 http://rss.cnn.com/rss/edition.rss是:

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" media="screen" href="/~d/styles/rss2full.xsl"?>
<?xml-stylesheet type="text/css" media="screen" href="http://rss.cnn.com/~d/styles/itemcontent.css"?>
<rss xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:media="http://search.yahoo.com/mrss/" version="2.0">
  <channel>
    <title><![CDATA[CNN.com - RSS Channel - Intl Homepage - News]]></title>
    <description><![CDATA[CNN.com delivers up-to-the-minute news and information on the latest top stories, weather, entertainment, politics and more.]]></description>
    <link>http://www.cnn.com/intl_index.html</link>
    ...

    <item>
      <title><![CDATA[Russia responds to claims it has damaging material on Trump]]></title>
      <description><![CDATA[The Kremlin denied it has compromising information about US President-elect Donald Trump, describing the allegations as "pulp fiction".]]></description>
      <link>http://www.cnn.com/2017/01/11/politics/russia-rejects-trump-allegations/index.html</link>
      <guid isPermaLink="true">http://www.cnn.com/2017/01/11/politics/russia-rejects-trump-allegations/index.html</guid>
      <pubDate>Wed, 11 Jan 2017 14:44:49 GMT</pubDate>
      <media:group>
        <media:content medium="image" url="http://i2.cdn.turner.com/cnnnext/dam/assets/161115120658-trump-putin-t1-tease-super-169.jpg" height="619" width="1100" />
        <media:content medium="image" url="http://i2.cdn.turner.com/cnnnext/dam/assets/161115120658-trump-putin-t1-tease-large-11.jpg" height="300" width="300" />
        <media:content medium="image" url="http://i2.cdn.turner.com/cnnnext/dam/assets/161115120658-trump-putin-t1-tease-vertical-large-gallery.jpg" height="552" width="414" />
        <media:content medium="image" url="http://i2.cdn.turner.com/cnnnext/dam/assets/161115120658-trump-putin-t1-tease-video-synd-2.jpg" height="480" width="640" />
        <media:content medium="image" url="http://i2.cdn.turner.com/cnnnext/dam/assets/161115120658-trump-putin-t1-tease-live-video.jpg" height="324" width="576" />
        <media:content medium="image" url="http://i2.cdn.turner.com/cnnnext/dam/assets/161115120658-trump-putin-t1-tease-t1-main.jpg" height="250" width="250" />
        <media:content medium="image" url="http://i2.cdn.turner.com/cnnnext/dam/assets/161115120658-trump-putin-t1-tease-vertical-gallery.jpg" height="360" width="270" />
        <media:content medium="image" url="http://i2.cdn.turner.com/cnnnext/dam/assets/161115120658-trump-putin-t1-tease-story-body.jpg" height="169" width="300" />
        <media:content medium="image" url="http://i2.cdn.turner.com/cnnnext/dam/assets/161115120658-trump-putin-t1-tease-t1-main.jpg" height="250" width="250" />
        <media:content medium="image" url="http://i2.cdn.turner.com/cnnnext/dam/assets/161115120658-trump-putin-t1-tease-assign.jpg" height="186" width="248" />
        <media:content medium="image" url="http://i2.cdn.turner.com/cnnnext/dam/assets/161115120658-trump-putin-t1-tease-hp-video.jpg" height="144" width="256" />
      </media:group>
    </item>
    ...

  </channel>
</rss>

如果您像这样使用 simplexml 解析此 XML:

  $rss = simplexml_load_file($url, null, LIBXML_NOCDATA);

  $rssjson = json_encode($rss);
  $rssarray = json_decode($rssjson, TRUE);

你会看到 <media:content>$rssarray 中完全缺失项目。所以我找到了一个 tutorial使用“命名空间”解决方案。但是,在示例中作者使用的是:

foreach ($xml->channel->item as $item) { ... }

但我正在使用(由于某些原因不能使用 foreach):

$rssjson = json_encode($rss);
$rssarray = json_decode($rssjson, TRUE);

所以我针对我的案例修改了解决方案:

  $rss = simplexml_load_file($url, null, LIBXML_NOCDATA);
  $namespaces = $rss->getNamespaces(true); // get namespaces

  $rssjson = json_encode($rss);
  $rssarray = json_decode($rssjson, TRUE);

  if (isset($rssarray['channel']['item'])) {
    foreach ($rssarray['channel']['item'] as $key => $item) {

      $media_content = $rss->channel->item[$key]->children($namespaces['media']);
      foreach($media_content as $tag) {

        $tagjson = json_encode($tag);
        $tagarray = json_decode($tagjson, TRUE);

      }

    }
  }

但它不起作用。对于我在 $tagarray 中得到的每件元素结果是一个具有这种结构的数组:

Array(
  'content' => array(
     '0' => array(null),
     '1' => array(null),
     ...
     '11' => array(null),
   )
)

它是一个数组,其项数与 <media:content> 的个数相同标签,但每个项目都是空的。我需要一个 url每个项目的属性。我做错了什么并得到一个空数组?

最佳答案

标签实际上是空的:

<media:content ... />
                   ^^

信息包含在属性中,可以用SimpleXMLElement::attributes() 获取。 ,例如:

$rss = simplexml_load_file($url, null, LIBXML_NOCDATA);
$namespaces = $rss->getNamespaces(true);
$media_content = $rss->channel->item[0]->children($namespaces['media']);
foreach($media_content->group->content as $i){
    var_dump((string)$i->attributes()->url);
}

我怀疑问题出在 JSON 技巧上。 SimpleXML 动态生成它的所有类和属性(它们不是常规的 PHP 类),这意味着您不能完全依赖标准的 PHP 功能,如 print_r()json_encode()。如果您在上面的循环中插入它,就会得到说明:

var_dump($i, json_encode($i), (string)$i->attributes()->url);
object(SimpleXMLElement)#2 (0) {
}
string(2) "{}"
string(91) "http://i2.cdn.turner.com/cnnnext/dam/assets/161115120658-trump-putin-t1-tease-super-169.jpg"
...

关于php - 如何解析<media :content> tag in RSS with simplexml,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/41595926/

相关文章:

rss - Node.js RSS 模块

php - AWS EC2 实例每晚崩溃......仅运行 Magento

php - 使用 PHPExcel 将 excel 文件导入 MySQL 表

tags - SiteCatalyst transactionID 数据源订单提交

objective-c - IOS:使用标签值更改 UIImageView

java - Head First Android 开发 - RSS 提要

php - 始终如一地将随机字符串更改为颜色

php - 维基百科上的 "edit section"功能是如何工作的?

database - 具有 GUI 前端的 Jumpdrive 可移植数据库

python - 使用 feedparser 检查 RSS 提要是否有新内容