hadoop - Gobblin Map-reduce作业在EMR上成功运行,但在s3中无输出

标签 hadoop amazon-s3 amazon-emr camus gobblin

我正在运行gobblin,以使用3节点EMR集群将数据从kafka移至s3。我在hadoop 2.6.0上运行,并且还针对2.6.0构建了gobblin。

似乎map-reduce作业成功运行。在我的hdfs上,我看到指标和工作目录。指标有一些文件,但工作目录为空。 S3存储桶应该有最终输出,但没有数据。最后,它说

输出任务状态路径/ gooblinOutput / working / GobblinKafkaQuickStart_mapR3 / output / job_GobblinKafkaQuickStart_mapR3_1460132596498不存在
删除的工作目录/ gooblinOutput / working / GobblinKafkaQuickStart_mapR3

这是最终的日志:

2016-04-08 16:23:26 UTC INFO  [main] org.apache.hadoop.mapreduce.Job  1366 -      Job job_1460065322409_0002 running in uber mode : false
2016-04-08 16:23:26 UTC INFO  [main] org.apache.hadoop.mapreduce.Job  1373 -  map 0% reduce 0%
2016-04-08 16:23:32 UTC INFO  [main] org.apache.hadoop.mapreduce.Job  1373 -  map 10% reduce 0%
2016-04-08 16:23:33 UTC INFO  [main] org.apache.hadoop.mapreduce.Job  1373 -  map 40% reduce 0%
2016-04-08 16:23:34 UTC INFO  [main] org.apache.hadoop.mapreduce.Job  1373 -  map 60% reduce 0%
2016-04-08 16:23:36 UTC INFO  [main] org.apache.hadoop.mapreduce.Job  1373 -  map 80% reduce 0%
2016-04-08 16:23:37 UTC INFO  [main] org.apache.hadoop.mapreduce.Job  1373 -  map 100% reduce 0%
2016-04-08 16:23:38 UTC INFO  [main] org.apache.hadoop.mapreduce.Job  1384 -      Job job_1460065322409_0002 completed successfully
2016-04-08 16:23:38 UTC INFO  [main] org.apache.hadoop.mapreduce.Job  1391 -      Counters: 30
    File System Counters
     FILE: Number of bytes read=0
    FILE: Number of bytes written=1276095
    FILE: Number of read operations=0
    FILE: Number of large read operations=0
    FILE: Number of write operations=0
    HDFS: Number of bytes read=28184
    HDFS: Number of bytes written=41960
    HDFS: Number of read operations=60
    HDFS: Number of large read operations=0
    HDFS: Number of write operations=11
Job Counters
    Launched map tasks=10
    Other local map tasks=10
    Total time spent by all maps in occupied slots (ms)=1828125
    Total time spent by all reduces in occupied slots (ms)=0
    Total time spent by all map tasks (ms)=40625
    Total vcore-seconds taken by all map tasks=40625
    Total megabyte-seconds taken by all map tasks=58500000
Map-Reduce Framework
    Map input records=10
    Map output records=0
    Input split bytes=2150
    Spilled Records=0
    Failed Shuffles=0
    Merged Map outputs=0
    GC time elapsed (ms)=296
    CPU time spent (ms)=10900
    Physical memory (bytes) snapshot=2715054080
    Virtual memory (bytes) snapshot=18852671488
    Total committed heap usage (bytes)=4729077760
File Input Format Counters
    Bytes Read=6444
File Output Format Counters
    Bytes Written=0
2016-04-08 16:23:38 UTC INFO  [TaskStateCollectorService STOPPING]    gobblin.runtime.TaskStateCollectorService  101 - Stopping the    TaskStateCollectorService
2016-04-08 16:23:38 UTC WARN  [TaskStateCollectorService STOPPING] gobblin.runtime.TaskStateCollectorService  123 - Output task state path /gooblinOutput/working/GobblinKafkaQuickStart_mapR3/output/job_GobblinKafkaQuickStart_mapR3_1460132596498 does not exist
2016-04-08 16:23:38 UTC INFO  [main] gobblin.runtime.mapreduce.MRJobLauncher  443 - Deleted working directory /gooblinOutput/working/GobblinKafkaQuickStart_mapR3
2016-04-08 16:23:38 UTC INFO  [main] gobblin.util.ExecutorsUtils  125 - Attempting to shutdown ExecutorService: java.util.concurrent.ThreadPoolExecutor@6c257d54[Shutting down, pool size = 1, active threads = 0, queued tasks = 0, completed tasks = 1]
 2016-04-08 16:23:38 UTC INFO  [main] gobblin.util.ExecutorsUtils  144 - Successfully shutdown ExecutorService: java.util.concurrent.ThreadPoolExecutor@6c257d54[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 1]  
 2016-04-08 16:23:38 UTC INFO  [main] gobblin.runtime.app.ServiceBasedAppLauncher  158 - Shutting down the application
 2016-04-08 16:23:38 UTC INFO  [MetricsReportingService STOPPING] gobblin.util.ExecutorsUtils  125 - Attempting to shutdown ExecutorService: java.util.concurrent.Executors$DelegatedScheduledExecutorService@5584dbb6
 2016-04-08 16:23:38 UTC INFO  [MetricsReportingService STOPPING] gobblin.util.ExecutorsUtils  144 - Successfully shutdown ExecutorService: java.util.concurrent.Executors$DelegatedScheduledExecutorService@5584dbb6
 2016-04-08 16:23:38 UTC WARN  [Thread-7] gobblin.runtime.app.ServiceBasedAppLauncher  153 - ApplicationLauncher has already stopped
 2016-04-08 16:23:38 UTC WARN  [Thread-4] gobblin.metrics.reporter.ContextAwareReporter  116 - Reporter MetricReportReporter has already been stopped.
 2016-04-08 16:23:38 UTC WARN  [Thread-4] gobblin.metrics.reporter.ContextAwareReporter  116 - Reporter MetricReportReporter has already been stopped.

这是我的配置文件:
gobblin-mapreduce.properties

# Thread pool settings for the task executor
taskexecutor.threadpool.size=2
taskretry.threadpool.coresize=1
taskretry.threadpool.maxsize=2

# File system URIs
fs.uri=hdfs://{host}:8020
writer.fs.uri=${fs.uri}
state.store.fs.uri=s3a://{bucket}/gobblin-mapr/

# Writer related configuration properties
writer.destination.type=HDFS
writer.output.format=AVRO
writer.staging.dir=${env:GOBBLIN_WORK_DIR}/task-staging
writer.output.dir=${env:GOBBLIN_WORK_DIR}/task-output

# Data publisher related configuration properties 
data.publisher.type=gobblin.publisher.BaseDataPublisher
data.publisher.final.dir=${env:GOBBLIN_WORK_DIR}/job-output
data.publisher.replace.final.dir=false

# Directory where job/task state files are stored
state.store.dir=${env:GOBBLIN_WORK_DIR}/state-store

# Directory where error files from the quality checkers are stored
qualitychecker.row.err.file=${env:GOBBLIN_WORK_DIR}/err

# Directory where job locks are stored
job.lock.dir=${env:GOBBLIN_WORK_DIR}/locks

# Directory where metrics log files are stored
metrics.log.dir=${env:GOBBLIN_WORK_DIR}/metrics

# Interval of task state reporting in milliseconds
task.status.reportintervalinms=5000

# MapReduce properties
mr.job.root.dir=${env:GOBBLIN_WORK_DIR}/working


# s3 bucket configuration

data.publisher.fs.uri=s3a://{bucket}/gobblin-mapr/
fs.s3a.access.key={key}
fs.s3a.secret.key={key}

文件2:kafka-to-s3.pull
job.name=GobblinKafkaQuickStart_mapR3
job.group=GobblinKafka_mapR3
job.description=Gobblin quick start job for Kafka
job.lock.enabled=false

kafka.brokers={kafka-host}:9092
topic.whitelist={topic_name}

source.class=gobblin.source.extractor.extract.kafka.KafkaSimpleSource
extract.namespace=gobblin.extract.kafka

writer.builder.class=gobblin.writer.SimpleDataWriterBuilder
writer.file.path.type=tablename
writer.destination.type=HDFS
writer.output.format=txt

data.publisher.type=gobblin.publisher.BaseDataPublisher

mr.job.max.mappers=10
bootstrap.with.offset=latest

metrics.reporting.file.enabled=true
metircs.enabled=true
metrics.reporting.file.suffix=txt

运行命令
export GOBBLIN_WORK_DIR=/gooblinOutput
Command : bin/gobblin-mapreduce.sh --conf /home/hadoop/gobblin-files/gobblin-dist/kafkaConf/kafka-to-s3.pull --logdir /home/hadoop/gobblin-files/gobblin-dist/logs

不知道怎么回事。有人可以帮忙吗?

最佳答案

有2期

我有data.publisher.final.dir = $ {env:GOBBLIN_WORK_DIR} / job-output

它应该像s3a://dev.com/gobblin-mapr6/

然后我以某种方式将特殊字符添加到topic.whitelist。因此,它无法识别主题

关于hadoop - Gobblin Map-reduce作业在EMR上成功运行,但在s3中无输出,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/36506750/

相关文章:

amazon-web-services - 什么是 AWS Request Metrics Full Support?如何关闭它?

amazon-web-services - 将外部域名连接到 AWS S3 网站

java - 如何终止 Amazon EMR 中的特定 JobFlow?

hadoop - 配置 EMR 集群以实现公平调度

hadoop - 如何读取子工作流(单独的 xml 文件)中的配置属性?

java - hadoop 失败的原因是什么?

python - 使用 boto3 时 S3 连接超时

pyspark - AWS EMR 从 S3 导入 pyfile

java - 如何从 Java 代码运行 Hadoop HDFS 命令

hadoop - 使用 copyFromLocal 开关将数据移动到 hdfs