python - Spark 读取文件不包含模式

标签 python bash apache-spark hadoop

df = sc.textFile("hdfs://n21-01-03/algo/ml_platform/downsample_data/nl/20180828/*/part-*.gz")

我用这段代码读取路径中的所有gz文件

    hdfs://n21-01-03/algo/ml_platform/downsample_data/nl/20180828/

这个路径下有24个文件,从00-23。但是如何读取文件 排除23文件?

drwxr-xr-x   - algo algo          0 2018-08-29 23:07 hdfs://n21-01-03/algo/ml_platform/downsample_data/nl/20180828/00
drwxr-xr-x   - algo algo          0 2018-08-29 23:11 hdfs://n21-01-03/algo/ml_platform/downsample_data/nl/20180828/01
drwxr-xr-x   - algo algo          0 2018-08-29 23:17 hdfs://n21-01-03/algo/ml_platform/downsample_data/nl/20180828/02
drwxr-xr-x   - algo algo          0 2018-08-29 23:23 hdfs://n21-01-03/algo/ml_platform/downsample_data/nl/20180828/03
drwxr-xr-x   - algo algo          0 2018-08-29 23:13 hdfs://n21-01-03/algo/ml_platform/downsample_data/nl/20180828/04
drwxr-xr-x   - algo algo          0 2018-08-29 23:19 hdfs://n21-01-03/algo/ml_platform/downsample_data/nl/20180828/05
drwxr-xr-x   - algo algo          0 2018-08-29 23:19 hdfs://n21-01-03/algo/ml_platform/downsample_data/nl/20180828/06
drwxr-xr-x   - algo algo          0 2018-08-29 23:19 hdfs://n21-01-03/algo/ml_platform/downsample_data/nl/20180828/07
drwxr-xr-x   - algo algo          0 2018-08-29 23:18 hdfs://n21-01-03/algo/ml_platform/downsample_data/nl/20180828/08
drwxr-xr-x   - algo algo          0 2018-08-29 23:21 hdfs://n21-01-03/algo/ml_platform/downsample_data/nl/20180828/09
drwxr-xr-x   - algo algo          0 2018-08-29 23:18 hdfs://n21-01-03/algo/ml_platform/downsample_data/nl/20180828/10
drwxr-xr-x   - algo algo          0 2018-08-29 23:19 hdfs://n21-01-03/algo/ml_platform/downsample_data/nl/20180828/11
drwxr-xr-x   - algo algo          0 2018-08-29 23:19 hdfs://n21-01-03/algo/ml_platform/downsample_data/nl/20180828/12
drwxr-xr-x   - algo algo          0 2018-08-29 23:19 hdfs://n21-01-03/algo/ml_platform/downsample_data/nl/20180828/13
drwxr-xr-x   - algo algo          0 2018-08-29 23:19 hdfs://n21-01-03/algo/ml_platform/downsample_data/nl/20180828/14
drwxr-xr-x   - algo algo          0 2018-08-29 23:17 hdfs://n21-01-03/algo/ml_platform/downsample_data/nl/20180828/15
drwxr-xr-x   - algo algo          0 2018-08-29 23:20 hdfs://n21-01-03/algo/ml_platform/downsample_data/nl/20180828/16
drwxr-xr-x   - algo algo          0 2018-08-29 23:18 hdfs://n21-01-03/algo/ml_platform/downsample_data/nl/20180828/17
drwxr-xr-x   - algo algo          0 2018-08-29 23:21 hdfs://n21-01-03/algo/ml_platform/downsample_data/nl/20180828/18
drwxr-xr-x   - algo algo          0 2018-08-29 23:19 hdfs://n21-01-03/algo/ml_platform/downsample_data/nl/20180828/19
drwxr-xr-x   - algo algo          0 2018-08-29 23:17 hdfs://n21-01-03/algo/ml_platform/downsample_data/nl/20180828/20
drwxr-xr-x   - algo algo          0 2018-08-29 23:19 hdfs://n21-01-03/algo/ml_platform/downsample_data/nl/20180828/21
drwxr-xr-x   - algo algo          0 2018-08-29 23:15 hdfs://n21-01-03/algo/ml_platform/downsample_data/nl/20180828/22
drwxr-xr-x   - algo algo          0 2018-08-29 23:21 hdfs://n21-01-03/algo/ml_platform/downsample_data/nl/20180828/23

最佳答案

某种解决方法,但希望对您有用。

import os
file_list = os.popen('hadoop fs -ls hdfs://n21-01-03/algo/ml_platform/downsample_data/nl/20180828/').readlines()
file_list = [x for x in file_list if (x not in ['23'])]
rdd = sc.textFile(file_list.mkString(","))

关于python - Spark 读取文件不包含模式,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/52090066/

相关文章:

python - 将json插入mysql。 json 字符串是从 json.dumps 中获取的

Python - if 条件的真实部分始终执行

ruby - 从 Ruby : capturing the output while displaying the output? 运行 shell 命令

bash - 用 sed 替换会忽略空格

string - 从双引号 bash 脚本中包含的字符串中删除空格

python - 在 Spark RDD 和/或 Spark DataFrames 中 reshape /透视数据

java - 将 DefaultMutableTreeNode 值设置为默认值,用于 Spark mapToPair 时

python - etree xml解析和删除

python - PySpark 中可变列数的总和

python - For 循环在另一个数据框中查找行