postgresql - 在不损失 Hadoop 并行处理能力的情况下,将 SAS 与 Hadoop 集成的最佳方法是什么

标签 postgresql hadoop sas apache-hive hawq

我想了解 SAS 和 Hadoop 之间的集成。据我了解,像 proc sql 这样的 SAS 进程只能对 SAS 数据集起作用,我不能对 hadoop 节点上的文本文件发出 proc sql。这是正确的吗?

如果是,那么我需要使用一些 ETL 作业首先将数据从 HDFS 中取出并将其转换为 SAS 表。但如果我这样做,我将失去 Hadoop 的并行处理能力,对吗?

那么集成 SAS 和 Hadoop 并仍然使用 Hadoop 的并行处理能力的理想方式是什么?

我知道您可以从 SAS 内部调用 map reduce 作业,但是 map reduce 作业可以用 SAS 编写吗?我认为不是。

最佳答案

SAS 全局论坛 2015 的主要插入力之一实际上是连接到 Hadoop 和 Teradata 的新选项。 FEDSQLDS2 是 SAS 9.4 中的新增功能,它们的存在部分是为了使 SAS 能够更好地与 Hadoop 协同工作。您可以直接在 Hadoop 节点中执行代码,也可以直接在 SAS 中进行更高效的处理。

假设您拥有最新版本的 SAS (9.4 TS1M3),您可以查看 SAS Release Notes (截至 2015 年 9 月 3 日的当前版本;将来这将指向更高版本)。其中包括如下信息:

In the second maintenance release for SAS 9.4, the SAS In-Database Code Accelerator for Hadoop runs the DS2 data program as well as the thread program inside the database. Several new functions have been added. The HTTP package enables you to construct an HTTP client to access web services and a new logger enables logging of HTTP traffic. A connection string parameter is available when instantiating an SQLSTMT package.

SAS FedSQL is a SAS proprietary implementation of the ANSI SQL:1999 core standard. It provides support for new data types and other ANSI 1999 core compliance features and proprietary extensions. FedSQL provides data access technology that brings a scalable, threaded, high-performance way to access, manage, and share relational data in multiple data sources. FedSQL is a vendor-neutral SQL dialect that accesses data from various data sources without submitting queries in the SQL dialect that is specific to the data source. In addition, a single FedSQL query can target data in several data sources and return a single result table. The FEDSQL procedure enables you to submit FedSQL language statements from a Base SAS session. The first maintenance release for SAS 9.4 adds support for Memory Data Store (MDS), SAP HANA, and SASHDAT data sources.

In the second maintenance release for SAS 9.4, SAS FedSQL supports Hive, HDMD, and PostgreSQL data sources. Data types can be converted to another data type. You can add DBMS-specific clauses to the end of the CREATE INDEX statement, and you can write a SASHDAT file in compressed format.

In the third maintenance release of SAS 9.4, FedSQL has added support for HAWQ and Impala distributions of Hadoop, enhanced support for Impala, new data types, and more.

Hadoop Support

The first maintenance release for SAS 9.4 enables you to use the SPD Engine to read, write, and update data in a Hadoop cluster through the HDFS. In addition, you can now use the HADOOP procedure to submit configuration properties to the Hadoop server.

In the second maintenance release for SAS 9.4, performance has been improved for the SPD Engine access to Hadoop. The SAS Hadoop Configuration Guide for Base SAS and SAS/ACCESS is available from the support.sas.com third-party site for Hadoop.

In the third maintenance release of SAS 9.4, access to data stored in HDFS is enhanced with a new distributed lock manager and therefore easier access to Hadoop clusters using Hadoop configuration files.

除此之外,还有大量关于该主题的文档和论文; SAS Connector for Hadoop 的文档,例如。

关于postgresql - 在不损失 Hadoop 并行处理能力的情况下,将 SAS 与 Hadoop 集成的最佳方法是什么,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/32383123/

相关文章:

ruby-on-rails - ActionView::Template::Error (PG::UndefinedColumn: ERROR: 列 "weeknumber"不存在

postgresql - 表创建的日期/时间格式

search - 使用 hadoop 进行日志搜索

hadoop - 将一张表的数据从HBase 0.94复制到HBase 0.98

date - 从宏变量中减去日期

sql - Proc SQL 在日期中添加天数

sql - 如何从多个表中获取匹配的记录

postgresql - 错误 : more than one owned sequence found in Postgres

hadoop - 如何附加到 Hadoop 用户程序中的现有文件?

sas - 计算百分比和分数