hadoop - 在 Hive 上交叉应用 SQL Server 查询

标签 hadoop hive hortonworks-data-platform hive-udf

HDP-2.5.0.0 使用 Ambari 2.4.0.1

Hive表ReportSetting如下:

id int

serializedreportsetting String

“serializedreportsetting”列是源 SQL Server 数据库中的 XML 数据类型,但在 Sqoop 导入期间被转换为字符串,这是它在 SQL Server 中的样子:

<ReportSettings4 xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema">
  <Trigger>
  <Manual>true</Manual>
  </Trigger>
<StartTime>
    <Year>8</Year>
    <Month>1</Month>
    <Day>1</Day>
    <Hour>0</Hour>
    <Minute>0</Minute>
  </StartTime>
  <ReportPeriod>
    <Month>0</Month>
    <Day>0</Day>
    <Hour>0</Hour>
    <Minute>5</Minute>
  </ReportPeriod>
  <Theft>
    <DigitalInput>true</DigitalInput>
    <Can>false</Can>
  </Theft>
  <SequenceNo>0</SequenceNo>
</ReportSettings4>

在 Hive 表中:

<ReportSettings4 xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"><Trigger><Manual>true</Manual></Trigger><StartTime><Year>8</Year><Month>12</Month><Day>31</Day><Hour>23</Hour><Minute>34</Minute></StartTime><ReportPeriod><Month>0</Month><Day>0</Day><Hour>4</Hour><Minute>0</Minute></ReportPeriod><Theft><DigitalInput>false</DigitalInput><Can>false</Can></Theft><SequenceNo>3</SequenceNo></ReportSettings4>

在 SQL Server 上运行良好的查询:

SELECT
r.VehicleId
,rs.value('(Trigger/Manual)[1]', 'bit') AS RS_Trigger_Manual, ,CAST(CONCAT(CASE WHEN rs.value('(StartTime/Year)[1]', 'int') < 10 THEN CONCAT('200',rs.value('(StartTime/Year)[1]', 'int')) ELSE CONCAT('20',rs.value('(StartTime/Year)[1]', 'int')) END,'-',rs.value('(StartTime/Month)[1]', 'int'),'-',rs.value('(StartTime/Day)[1]', 'int'),' ',rs.value('(StartTime/Hour)[1]', 'int'),':',rs.value('(StartTime/Minute)[1]', 'int'),':','00.000') AS datetime) AS RS_StartTime
,rs.value('(ReportPeriod/Month)[1]', 'int') AS RS_ReportPeriod_Month
,rs.value('(ReportPeriod/Day)[1]', 'int') AS RS_ReportPeriod_Day
,rs.value('(ReportPeriod/Hour)[1]', 'int') AS RS_ReportPeriod_Hour
,rs.value('(ReportPeriod/Minute)[1]', 'int') AS RS_ReportPeriod_Minute
,rs.value('(Theft/DigitalInput)[1]', 'bit') AS RS_Theft_DigitalInput
,rs.value('(Theft/Can)[1]', 'bit') AS RS_Theft_Can,rs.value('(SequenceNo)[1]', 'int') 

AS RS_SequenceNo FROM ReportSetting r
  CROSS APPLY SerializedReportSetting.nodes('/*') AS ReportSettings(rs)

我可以想到/做以下事情:

  1. 要使用 CROSS APPLY,我猜 lateral view需要使用,这里我没有将 serializedreportsetting 作为数组,所以 explode() 不会起作用。有人可以验证我的思考方向是否正确吗
  2. 我只是尝试使用 built-in xpath udf 将 serializedreportsetting 中的数据作为列获取, 然而,我没有得到任何记录,少数试验如下:

    select xpath(SerializedReportSetting,'/*') from ReportSetting limit 1;

    从ReportSetting限制1中选择xpath(SerializedReportSetting,'/ReportSettings4');

    select xpath(SerializedReportSetting,'/Trigger/Manual') from ReportSetting limit 1;

************更新-1************

我使用 regexp_replace 来应对上述挑战:

SELECT id,
  xpath_string(SerializedReportSetting,'/ReportSettings/Trigger/Manual')        AS RS_Trigger_Manual,
  xpath_string(SerializedReportSetting,'/ReportSettings/Trigger/DriveChange')   AS RS_Trigger_DriveChange
FROM
  (SELECT id,
    regexp_replace(SerializedReportSetting, 'ReportSettings+\\d','ReportSettings') AS SerializedReportSetting
  FROM reportsetting
  WHERE id IN (1701548,3185,1700231,1700232)
  ) reportsetting_regex;

最佳答案

xpath他们明确地说:

The xpath() function always returns a hive array of strings. If the expression results in a non-text value (e.g., another xml node) the function will return an empty array

因此您可以使用:xpath(SerializedReportSetting,'/ReportSettings4/Trigg‌ er/Manual/text()') from ReportSetting limit 1;

或者更好的选择是使用 xpath_boolean/xpath_int:

xpath_boolean - Returns true if the XPath expression evaluates to true, or if a matching node is found.

xpath_boolean(SerializedReportSetting,'/ReportSettings4/Trigg‌ er/Manual') 来自 ReportSetting 限制 1;

xpath_short, xpath_int, xpath_long These functions return an integer numeric value, or the value zero if no match is found, or a match is found but the value is non-numeric. Mathematical operations are supported. In cases where the value overflows the return type, then the maximum value for the type is returned.

xpath_int(SerializedReportSetting,'/ReportSettings4/ReportPeriod/Month') from ReportSetting limit 1;

关于hadoop - 在 Hive 上交叉应用 SQL Server 查询,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/40398868/

相关文章:

java - 如何将附加数据传递给 Mapper?

hadoop - Neo4j Hadoop集成

apache-spark - Azure HDInsight 的 SparkRunner 上的 Apache Beam 管道

hadoop - 针对 HAVING 和 Distinct 的 HIVE 查询

linux - ambari-server-1.7.0-169.noarch 包需要错误 : Missing Dependency: python >= 2. 6

java - 即使已实现,也会收到工具界面警告

hadoop - HiveServer2 无法启动

hive - 使用 Amazon Athena 和 Open JSONx Serde 处理重复 key

performance - hive 查询中 where 条件的顺序是否会影响查询性能?

linux - 如何在启动时执行其他所有内容后执行命令?