python - 使用python使用hadoop处理xml文件

我使用带有hadoop的python处理xml文件，我有以下格式的xml文件

temporary.xml

<report>
<report-name name="ALL_TIME_KEYWORDS_PERFORMANCE_REPORT"/>
<date-range date="All Time"/>
  <table>
    <columns>
       <column name="campaignID" display="Campaign ID"/>
       <column name="adGroupID" display="Ad group ID"/>
       <column name="keywordID" display="Keyword ID"/>
       <column name="keyword" display="Keyword"/>
    </columns>
    <row campaignID="79057390" adGroupID="3451305670" keywordID="3000000" keyword="Content"/>
    <row campaignID="79057390" adGroupID="3451305670" keywordID="3000000" keyword="Content"/>
    <row campaignID="79057390" adGroupID="3451305670" keywordID="3000000" keyword="Content"/>
    <row campaignID="79057390" adGroupID="3451305670" keywordID="3000000" keyword="Content"/>
  </table>
</report>

现在，我要做的就是处理上面的xml文件，然后将数据保存到MSSQL数据库中。

mapper.py 代码

import sys
import cStringIO
import xml.etree.ElementTree as xml

if __name__ == '__main__':
    buff = None
    intext = False
    for line in sys.stdin:
        line = line.strip()
        if line.find("<row>") != -1:
            intext = True
            buff = cStringIO.StringIO()
            buff.write(line)
        elif line.find("</>") != -1:
            intext = False
            buff.write(line)
            val = buff.getvalue()
            buff.close()
            buff = None
            print val

在这里，我要做的就是从row tags中获取的数据(即campaignID,adgroupID,keywordID,keyword的值)并将其打印出来，作为reducer.py的输入(包括将数据保存在数据库中的代码)。

我查看了一些示例，但是标签就像<tag> </tag>，但是在我的情况下，我只有<row/>
但是我上面的代码不起作用/什么都不打印，任何人都可以更正我的代码并添加必要的python代码以从行标记中获取值/数据(我对hadoop非常非常陌生)，这样可以扩展下次的代码。

最佳答案

您是否考虑过使用xpath？这是一种迷你语言，可用于绕过xml树。它可以在python中轻松使用。

http://docs.python.org/2/library/xml.etree.elementtree.html可能对您有用

您可能还想看看Need Help using XPath in ElementTree

这是我的操作方法(这是有效的Python代码。我在Python3.2中对其进行了测试。可与您的示例xml很好地配合使用):

import xml.etree.ElementTree as xml #you had this line in your code. I am not using any tool you  do not have access to in your script

def get_row_attributes(the_xml_as_a_string):
    """
    this function takes xml as a string. 
    It can work with xml that looks like your included example xml.
    This function returns a list of dictionaries. Each dictionary is made up of the attributes of each row. So the result looks like:
     [
          {attribute_name:value_for_first_row,attribute_name:value_for_first_row...},
          {attribute_name:value_for_second_row,attribute_name:value_for_second_row...},
          etc
     ]
    """
    tree = xml.fromstring(the_xml_as_a_string)
    rows = tree.findall('table/row')  # 'table/row' is xpath. it means get all the rows in all the tables
    return [row.attrib for row in rows]

要使用此功能，请读入std并建立一个字符串。 call get_row_attributes(the_xml_as_a_string)
结果字典包含您请求的信息(行的属性)。

所以现在我们有了

从std-in

中读取内容

获得了有关所有行的所有信息

全部使用完全正常的python

最后要做的是将其写入其他进程。如果您需要有关此部分的帮助，请提供有关数据应采用的格式以及数据去向的信息

关于python - 使用python使用hadoop处理xml文件，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/13267925/

python - 使用python使用hadoop处理xml文件

上一篇：hadoop - 作业期间更改了Hadoop分布式缓存对象

下一篇：java - 远程运行Hadoop mapreduce作业会导致EOFException？