Hadoop Mapreduce 功能

假设我想使用"Order By" 子句执行Select 查询，并且我的数据分布在多台机器 上。 Map 如何减少获取数据 以及它在哪里执行“Order By” 查询。

最佳答案

Map-Reduce 可用于实现分布式“Order By”。

... One of Yahoo’s Hadoop clusters sorted 1 terabyte of data in 209 seconds ... The sort used 1800 maps and 1800 reduces ...

Apache Hadoop Wins Terabyte Sort Benchmark

这可以通过按值将顺序键映射到范围来完成。

然而，Hive 正在使用单个 reducer 实现“Order By”。

... in order to impose total order of all results, there has to be one reducer to sort the final output. If the number of rows in the output is too large, the single reducer could take a very long time to finish...

Hive - LanguageManual - Sort By - Syntax of Order By

关于Hadoop Mapreduce 功能，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/42924592/

上一篇：mysql - sqoop 作业将数据导出到 mysql，卡在 map 100% with status running

下一篇：hadoop - 由于 HDP 2.5 中的 Solr 异常，Hbase java 代码在表创建时卡住 - SolrServers 可用于处理此请求

scala - 从自定义数据格式创建 spark 数据框

mongodb - 使用 MongoDB Hadoop 驱动程序创建 Hive 表

hadoop - 有没有办法在将数据从 HIVE 移动到 ES 时跳过 ES 中的行插入？

arrays - Hive Array<Struct<>>插入显示null

hadoop - 如何在Hive，Impala或Spark中转置数据？

sql - Hive Window在多个日期范围内的功能

hadoop - PL/SQL 能否可靠地转换为 Pig Lating 或带有 Pig Latin 和 Hive 的 Oozie 管道

regex - hive JSON正则表达式

hadoop - Hive 在创建表/数据库时抛出权限错误