azure - 加载数据时 Openrowset 内部如何工作

我正在浏览 azure 文档并遇到以下短语

OPENROWSET function in Synapse SQL reads the content of the file(s) from a data source. The data source is an Azure storage account and it can be explicitly referenced in the OPENROWSET function or can be dynamically inferred from URL of the files that you want to read.

数据在哪里加载和处理 - 是在内存中吗？它是否像 Spark 一样以 block 的形式加载数据？
而且 Openrowset 似乎支持无服务器 sql 池，而不支持专用 sql 池 - 这样做的理由可能是什么，尽管这两个池都由 MS sql server 备份，而 MS sql server 实际上本身支持 OPENROWSET。<

最佳答案

OPENROWSET function in Synapse SQL reads the content of the file(s) from a data source. The data source is an Azure storage account and it can be explicitly referenced in the OPENROWSET function or can be dynamically inferred from URL of the files that you want to read.

where does the data is loaded and processed - is it in memory . Does it load the data in chunks similar to spark does ?

因为，OPENROWSET 函数仅在 Serverless Synapse SQL 中受支持。目前，它使用无服务器架构，只有一个计算节点，可以根据需要扩展分布式计算。您的数据是在由计算节点支持的多个分布式小任务中查询的，这与专用突触 SQL 中每个任务的专用计算节点不同。 Serverless SQL 中的分布式查询处理引擎会将所有 SQL 查询转换为一个小任务，并将这些任务分配给计算节点，该节点将从存储帐户查询数据。无服务器 Spark 池和无服务器 SQL 都在相同的架构上工作，在需要运行查询时扩展计算，并在不需要时缩小计算规模。

enter image description here

图像引用 - Synapse SQL architecture - Azure Synapse Analytics | Microsoft Learn

要从 Azure 存储读取和访问文件，使用了 2 种方法。
OPENROWSET 和外部表。

OPENROWSET 用于以行集的形式获取 azure 存储中的数据，它可用于通过各种 azure ad 身份验证连接到远程数据源，或者可用于获取批量数据以获取多个直接来自 azure 存储的行集形式的数据集。类似于SQL的FROM子句。

外部表用于读取位于 Hadoop、Azure 存储、Azure 存储 Blob、数据湖存储中的数据。

And also it seems Openrowset is supported with serverless sql pool and
not supported with dedicated sql pool - what could have been the
rationale in doing so , though both the pools backed up by MS sql
server which actually natively supports OPENROWSET.

要连接到不经常引用的数据源， native 使用 OPENROWSET 或 OPENDATASOURCE 方法以及指定的信息来连接到不经常访问的链接服务器。然后，该行集将作为 SQL 表中的事务 SQL 语句进行引用。
目前，Azure 专用 Synapse SQL 不支持 OPENROWSET 函数。
请参阅此处:-
https://learn.microsoft.com/en-us/sql/t-sql/functions/openrowset-transact-sql?view=sql-server-ver16

OPENROWSET() for Synapse dedicated pools?作者:[斯特凡·阿扎里奇]

查询:-

    OPENROWSET
   ({ BULK 'unstructured_data_path' . [DATA_SOURCE = <data source name>, ]
      FORMAT ['PARQUET' | 'DELTA'] }
   )
   [WITH ( {'column_name' 'column_type' }) ]
   [AS] table_alias(column_alias, ...n)

Openrowset 使用批量的 FROM 子句，数据源设置为 Azure 存储帐户，格式支持 csv、parquet、delta、json。

enter image description here

SELECT *
FROM OPENROWSET(
   BULK '<storagefile-url>,
   FORMAT = '<format-of-file>
   PARSER_VERSION = '2.0'
   HEADER_ROW = True
) as rowsFromFile

enter image description here

附带条款-

SELECT *
FROM OPENROWSET(
   BULK '<storagefile-url>,
   FORMAT = '<format-of-file>
   PARSER_VERSION = '2.0'
   HEADER_ROW = True
)
WITH
(
   columnname 
) as output-table

enter image description here

因为，这是基于无服务器架构>每个查询都分布在小任务中并由计算节点运行。

关于azure - 加载数据时 Openrowset 内部如何工作，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/75180353/

azure - 加载数据时 Openrowset 内部如何工作

上一篇：azure - 如何通过授权保护Azure Function？

下一篇：azure - 与旧库相比，新库的 Cosmos DB 速度非常慢