hadoop - pyspark : how to check if a file exists in hdfs

标签 hadoop apache-spark filesystems hdfs pyspark

我想在通过 SparkContext 加载它们之前检查 hdfs 中是否存在多个文件。我使用 pyspark。我试过 os.system("hadoop fs -test -e %s"%path) 但是由于我有很多路径要检查，所以作业崩溃了。我还尝试了 sc.wholeTextFiles(parent_path) 然后按键过滤。但它也崩溃了，因为 parent_path 包含很多子路径和文件。你可以帮帮我吗？

最佳答案

正确的说法Tristan Reid :

...(Spark) It can read many formats, and it supports Hadoop glob expressions, which are terribly useful for reading from multiple paths in HDFS, but it doesn't have a builtin facility that I'm aware of for traversing directories or files, nor does it have utilities specific to interacting with Hadoop or HDFS.

无论如何，这是他对相关问题的回答:Pyspark: get list of files/directories on HDFS path

一旦你有了目录中的文件列表，就很容易检查特定文件是否存在。

希望对你有所帮助。

关于hadoop - pyspark : how to check if a file exists in hdfs，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/32334772/

上一篇：hadoop - EMR 集群中的 "LOST"节点

下一篇：scala - Spark 驱动程序被 master 解除关联并删除

winapi - 检测自上次扫描以来文件系统中的更改

hadoop - isSplittable() 方法

hadoop - 将参数传递给配置单元查询

apache-spark - 如何减少 EMR 中 Apache Spark 的日志？

apache-spark - KryoException : Unable to find class with spark structured streaming

linux - 删除正在进行 I/O 的文件 : Is it a filesystem and/or an OS feature?

LInux:它如何处理从文件中间删除内容

hadoop - default.fs.name 和 hive.metastore.warehouse.dir 不冲突

java - hbase 导出到平面文件