java - 如何在 Hadoop Mapper 中处理 XML 文件

标签 java xml hadoop

我有以下格式的大型 XML 文件。我可以逐行读取并执行一些字符串操作,因为我只需要提取几个字段的值。但是,一般来说,我们如何处理以下格式的文件?我找到了 Mahout XML 解析器,但我认为它不适合以下格式。

<?xml version="1.0" encoding="utf-8"?>
  <row Id="1" PostTypeId="1" AcceptedAnswerId="13" CreationDate="2010-09-13T19:16:26.763" Score="155" ViewCount="160162" Body="&lt;p&gt;This is a common question by those who have just rooted their phones.  What apps, ROMs, benefits, etc. do I get from rooting?  What should I be doing now?&lt;/p&gt;&#xA;" OwnerUserId="10" LastEditorUserId="16575" LastEditDate="2013-04-05T15:50:48.133" LastActivityDate="2013-09-03T05:57:21.440" Title="I've rooted my phone.  Now what?  What do I gain from rooting?" Tags="&lt;rooting&gt;&lt;root&gt;" AnswerCount="2" CommentCount="0" FavoriteCount="107" CommunityOwnedDate="2011-01-25T08:44:10.820" />
  <row Id="2" PostTypeId="1" AcceptedAnswerId="4" CreationDate="2010-09-13T19:17:17.917" Score="10" ViewCount="966" Body="&lt;p&gt;I have a Google Nexus One with Android 2.2. I didn't like the default SMS-application so I installed Handcent-SMS. Now when I get an SMS, I get notified twice. How can I fix this?&lt;/p&gt;&#xA;" OwnerUserId="7" LastEditorUserId="981" LastEditDate="2011-11-01T18:30:32.300" LastActivityDate="2011-11-01T18:30:32.300" Title="I installed another SMS application, now I get notified twice" Tags="&lt;2.2-froyo&gt;&lt;sms&gt;&lt;notifications&gt;&lt;handcent-sms&gt;" AnswerCount="3" FavoriteCount="2" />


您发布的数据来自 SO 数据转储(我知道是因为我目前正在 Hadoop 上使用它)。以下是我编写的映射器,用于创建一个制表符分隔的文件。

您实际上是逐行阅读并使用 JAXP api 解析和提取所需信息

public class StackoverflowDataWranglerMapper extends Mapper<LongWritable, Text, Text, Text>

    private final Text outputKey = new Text();
    private final Text outputValue = new Text();

    private final DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
    private DocumentBuilder builder;
    private static final Joiner TAG_JOINER = Joiner.on(",").skipNulls();
    // 2008-07-31T21:42:52.667
    private static final DateFormat DATE_PARSER = new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss.SSS");
    private static final SimpleDateFormat DATE_BUILDER = new SimpleDateFormat("yyyy-MM-dd");

    protected void setup(Context context) throws IOException, InterruptedException
            builder = factory.newDocumentBuilder();
        catch (ParserConfigurationException e)
            new IOException(e);

    protected void map(LongWritable inputKey, Text inputValue, Mapper<LongWritable, Text, Text, Text>.Context context)
            throws IOException, InterruptedException
            String entry = inputValue.toString();
            if (entry.contains("<row "))
                Document doc = builder.parse(new InputSource(new StringReader(entry)));
                Element rootElem = doc.getDocumentElement();

                String id = rootElem.getAttribute("Id");
                String postedBy = rootElem.getAttribute("OwnerUserId").trim();
                String viewCount = rootElem.getAttribute("ViewCount");
                String postTypeId = rootElem.getAttribute("PostTypeId");
                String score = rootElem.getAttribute("Score");
                String title = rootElem.getAttribute("Title");
                String tags = rootElem.getAttribute("Tags");
                String answerCount = rootElem.getAttribute("AnswerCount");
                String commentCount = rootElem.getAttribute("CommentCount");
                String favoriteCount = rootElem.getAttribute("FavoriteCount");
                String creationDate = rootElem.getAttribute("CreationDate");

                Date parsedDate = null;
                if (creationDate != null && creationDate.trim().length() > 0)
                        parsedDate = DATE_PARSER.parse(creationDate);
                    catch (ParseException e)
                        context.getCounter("Bad Record Counters", "Posts missing CreationDate").increment(1);

                if (postedBy.length() == 0 || postedBy.trim().equals("-1"))
                    context.getCounter("Bad Record Counters", "Posts with either empty UserId or UserId contains '-1'")
                        parsedDate = DATE_BUILDER.parse("2100-00-01");
                    catch (ParseException e)
                        // ignore

                tags = tags.trim();
                String tagTokens[] = null;

                if (tags.length() > 1)
                    tagTokens = tags.substring(1, tags.length() - 1).split("><");
                    context.getCounter("Bad Record Counters", "Untagged Posts").increment(1);


                StringBuilder sb = new StringBuilder(postedBy).append("\t").append(parsedDate.getTime()).append("\t")

                if (tagTokens != null)


                context.write(outputKey, outputValue);
        catch (SAXException e)
            context.getCounter("Bad Record Counters", "Unparsable records").increment(1);

关于java - 如何在 Hadoop Mapper 中处理 XML 文件,我们在Stack Overflow上找到一个类似的问题:


java - 如何在javacv中获取提取对象的x,y坐标?

java - Spring : message. 属性文件不起作用

hadoop - 使用 Hbase 配置 Hive

hadoop - Hadoop Src 2.7.1 的 Reduce 阶段中,哪个函数对 Map 任务的输出进行排序,排序阶段何时开始?

java - 在 JavaFX 桌面应用程序中获取远程 IP 地址

java - ArrayList<HashMap<String,String>> 按字母顺序排列

python - XML 文件解析 - 从子级的子级获取数据

xml - XSLT 子字符串 - 在特定字符串之前获取所有值

scala - 如何在Scala中循环播放每一行

java - 使用java实现定时器任务