我在 S3 中有一个显式修剪的模式结构,当我 read.parquet()
时导致以下错误:
Caused by: java.lang.AssertionError: assertion failed: Conflicting directory structures detected. Suspicious paths
s3a://leftout/for/security/dashboard/updateddate=20170217
s3a://leftout/for/security/dashboard/updateddate=20170218
(冗长的)错误告诉我进一步......
If provided paths are partition directories, please set "basePath" in the options of the data source to specify the root directory of the table.
但是,我找不到任何有关如何使用 SparkR::read.parquet(...)
执行此操作的文档。有谁知道如何在 R 中(使用 SparkR)执行此操作?
> version
platform x86_64-redhat-linux-gnu
arch x86_64
os linux-gnu
system x86_64, linux-gnu
status
major 3
minor 2.2
year 2015
month 08
day 14
svn rev 69053
language R
version.string R version 3.2.2 (2015-08-14)
nickname Fire Safety
> sessionInfo()
R version 3.2.2 (2015-08-14)
Platform: x86_64-redhat-linux-gnu (64-bit)
Running under: Amazon Linux AMI 2016.09
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] lubridate_1.6.0 SparkR_2.0.2 DT_0.2 jsonlite_1.2 shinythemes_1.1.1 ggthemes_3.3.0
[7] dplyr_0.5.0 ggplot2_2.2.1 leaflet_1.0.1 shiny_1.0.0
loaded via a namespace (and not attached):
[1] Rcpp_0.12.9 magrittr_1.5 munsell_0.4.3 colorspace_1.3-2 xtable_1.8-2 R6_2.2.0
[7] stringr_1.1.0 plyr_1.8.4 tools_3.2.2 grid_3.2.2 gtable_0.2.0 DBI_0.5-1
[13] sourcetools_0.1.5 htmltools_0.3.5 yaml_2.1.14 lazyeval_0.2.0 digest_0.6.12 assertthat_0.1
[19] tibble_1.2 htmlwidgets_0.8 mime_0.5 stringi_1.1.2 scales_0.4.1 httpuv_1.3.3
最佳答案
In Spark 2.1 or later您可以将 basePath
作为命名参数传递:
read.parquet(path, basePath="s3a://leftout/for/security/dashboard/")
省略号捕获的参数将转换为 varargsToStrEnv
和 used as options
.
完整 session 示例:
写入数据(Scala):
Seq(("a", 1), ("b", 2)).toDF("k", "v") .write.partitionBy("k").mode("overwrite").parquet("/tmp/data")
读取数据(SparkR):
Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 2.1.0 /_/ SparkSession available as 'spark'.
> paths <- dir("/tmp/data/", pattern="*parquet", full.names=TRUE, recursive=TRUE) > read.parquet(paths, basePath="/tmp/data")
SparkDataFrame[v:int, k:string]
相反,没有
basePath
:> read.parquet(paths)
SparkDataFrame[v:int]
关于r - SparkR 中有基本路径数据选项吗?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/42377891/