我在 scala 和 spark 环境中工作,我想在其中读取 Parquet 文件。在阅读之前,我想检查文件是否存在。我正在 jupyter notebook 中编写以下代码,但它不起作用 - 这意味着它不显示任何框架,因为函数 testDirExist 返回 false
import org.apache.hadoop.fs.FileSystem
import org.apache.hadoop.fs.Path
val hadoopfs: FileSystem = FileSystem.get(spark.sparkContext.hadoopConfiguration)
def testDirExist(path: String): Boolean = {
val p = new Path(path)
hadoopfs.exists(p) && hadoopfs.getFileStatus(p).isDirectory
}
val pt = "abfss://container@account.dfs.core.windows.net/blah/blah/blah
val exists = testDirExist(pt)
if(exists)
{
val dataframe = spark.read.parquet(pt)
dataframe.show()
}
但是,以下代码有效。它显示数据框
val k = spark.read.parquet("abfss://container@account.dfs.core.windows.net/blah/blah/blah)
k.show()
谁能帮我检查文件是否存在?
谢谢
最佳答案
您只需将默认文件系统设置为您的存储帐户:
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.FileSystem
import org.apache.hadoop.fs.Path
import java.io.PrintWriter
val conf = new Configuration()
conf.set("fs.defaultFS", "abfss://<container_name>@<account_name>.dfs.core.windows.net")
conf.set("fs.azure.account.auth.type.<container_name>.dfs.core.windows.net", "OAuth")
conf.set("fs.azure.account.oauth.provider.type.<container_name>.dfs.core.windows.net", "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider")
conf.set("fs.azure.account.oauth2.client.id.<container_name>.dfs.core.windows.net", "<client_id>")
conf.set("fs.azure.account.oauth2.client.secret.<container_name>.dfs.core.windows.net", "<secret>")
conf.set("fs.azure.account.oauth2.client.endpoint.<container_name>.dfs.core.windows.net", "https://login.microsoftonline.com/<tenant_id>/oauth2/token")
val fs= FileSystem.get(conf)
val ostream = fs.create(new Path("/abfss_test.out"))
val pwriter = new PrintWriter(ostream)
try {
pwriter.write("Azure Datalake Gen2 test")
pwriter.write("\n")
}
finally {
pwriter.close()
}
// check if the file we've just created exists
println(fs.exists(new Path("/abfss_test.out")))
关于scala - 识别文件夹是否存在于 ADLS gen 2 帐户中的正确方法是什么,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/60086978/