python - 如何在azure blob上保存spark-df-profiling生成的html报告？

标签 python pyspark profiling azure-blob-storage

我正在使用spark-df-profiling包在azure databricks中生成分析报告。但是 ProfileReport 中的 to_file 函数会生成一个 html 文件，我无法在 azure blob 上写入该文件。

已经尝试过:

包含容器和存储帐户名称的 wasb 路径
创建了空的 html 文件，上传到 blob 并使用该 url 进行写入
为上面创建的空文件生成 sas token 并给定该 url

profile = spark_df_profiling.ProfileReport(df)
profile.to_file(paths in already tried)

我想将输出保存在提供的此路径上

最佳答案

在查看了julioasotodv/spark-df-profiling版本v1.1.13的源代码后，我通过下面的代码解决了这个问题。首先请引用Azure Databricks官方文档Data Sources > Azure Blob Storage和 Databricks File System让 dbutils 了解如何将数据写入指定的数据源(例如 Azure 存储)。

这是我的示例代码，它适用于我的 Azure Databricks 和 Azure 存储。

storage_account_name='<your storage account name>'
storage_account_access_key='<your storage account key>'
spark.conf.set(
  "fs.azure.account.key."+storage_account_name+".blob.core.windows.net",
  storage_account_access_key)

# My sample pandas dataframe for testing
import pandas as pd
d = {'col1': [1, 2], 'col2': [3, 4]}
pd_df = pd.DataFrame(data=d)

import spark_df_profiling
from spark_df_profiling.templates import template
df = spark.createDataFrame(pd_df)
profile = spark_df_profiling.ProfileReport(df)
dbutils.fs.put("wasbs://<your container name>@ppas.blob.core.windows.net/test.html", template('wrapper').render(content=profile.html))

我可以通过结果 True 看到它的工作原理，并将 29806 字节输出到 Azure Blob，然后在 Azure 存储资源管理器中检查它。

希望有帮助。

关于python - 如何在azure blob上保存spark-df-profiling生成的html报告？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/57395324/

上一篇：python - 无法使用 boto3 paginate 迭代 cloudwatch 上的所有警报

下一篇：python - 如何使用groupby创建条件列？

C gperftools - 分析 C 代码

python - 从 Python 修改 Windows 环境变量的接口(interface)

apache-spark - 当系列到系列(PandasUDFType.SCALAR)可用时，为什么系列迭代器到系列 pandasUDF(PandasUDFType.SCALAR_ITER)的迭代器？

python - 平铺数据以创建 pandas 数据框

json - 如何使用Python解析Spark 1.6中格式错误的JSON字符串，其中包含空格，多余的双引号和反斜杠？

asp.net - 迁移匿名配置文件的最佳方式

java - 为什么 VisualVM 中的 CPU 时间不加起来？

python - Django: AttributeError: 'Q' 对象没有属性 'count'

python - 如何将应用程序设置保存在配置文件中？