hadoop - 如何在 google dataproc 上部署 nutch 作业？

我一直在尝试在我的 Google Hadoop dataproc 集群上部署一个 nutch 作业(使用自定义插件)，但我遇到了很多错误(我怀疑是一些基本错误)。

我需要有关如何执行此操作的分步明确指南。该指南应包括如何在 gs 存储桶和本地文件系统 (Windows 7) 中设置权限和访问文件。

我试过这个配置但没有成功:

Region: global
Cluster: first-cluster
Job type: Hadoop
Jar files gs://dataproc-60f583ce-a087-42b1-a62e-4319584631d3-asia/depl‌oy/apache-nutch-1.12‌-SNAPSHOT.job
Main class or jar: gs://dataproc-60f583ce-a087-42b1-a62e-4319584631d3-<p>asia/ deploy/bin/nutch
Arguments: gs://dataproc-60f583ce-a087-42b1-a62e-4319584631d3-asia/seed‌.txt, -depth 4

我也试过:

Region: global 
Cluster: first-cluster
Job type: Hadoop
Jar files gs://dataproc-60f583ce-a087-42b1-a62e-4319584631d3-asia/depl‌oy/apache-nutch-1.12‌-SNAPSHOT.job
Main class or jar: gs://dataproc-60f583ce-a087-42b1-a62e-4319584631d3-asia/ deploy/bin/crawl
Arguments: gs://dataproc-60f583ce-a087-42b1-a62e-4319584631d3-asia/seed‌.txt, -depth 4

和:

Region: global
Cluster: first-cluster
Job type: Hadoop
Jar files gs://dataproc-60f583ce-a087-42b1-a62e-4319584631d3-asia/depl‌oy/apache-nutch-1.12‌-SNAPSHOT.job
Main class or jar: org.apache.nutch.crawl.Crawl
Arguments: gs://dataproc-60f583ce-a087-42b1-a62e-4319584631d3-asia/seed‌.txt, -depth 4

跟进:我取得了一些进展，但我现在收到此错误:

17/07/28 18:59:11 INFO crawl.Injector: Injector: starting at 2017-07-28 18:59:11
17/07/28 18:59:11 INFO crawl.Injector: Injector: crawlDb: https://console.cloud.google.com/storage/browser/dataproc-60f583ce-a087-42b1-a62e-4319584631d3-asia/deploy/urls
17/07/28 18:59:11 INFO crawl.Injector: Injector: urlDir: https://console.cloud.google.com/storage/browser/dataproc-60f583ce-a087-42b1-a62e-4319584631d3-asia/deploy/crawlDb
17/07/28 18:59:11 INFO Configuration.deprecation: mapred.temp.dir is deprecated. Instead, use mapreduce.cluster.temp.dir
17/07/28 18:59:11 INFO crawl.Injector: Injector: Converting injected urls to crawl db entries.
17/07/28 18:59:11 INFO gcs.GoogleHadoopFileSystemBase: GHFS version: 1.6.1-hadoop2
17/07/28 18:59:11 ERROR crawl.Injector: Injector: java.lang.IllegalArgumentException: Wrong FS: https://console.cloud.google.com/storage/browser/dataproc-60f583ce-a087-42b1-a62e-4319584631d3-asia/deploy/urls, expected: hdfs://first-cluster-m
    at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:648)
    at org.apache.hadoop.hdfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:194)
    at org.apache.hadoop.hdfs.DistributedFileSystem.access$000(DistributedFileSystem.java:106)
    at org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1305)
    at org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1301)
    at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
    at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1301)
    at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1426)
    at org.apache.nutch.crawl.Injector.inject(Injector.java:298)
    at org.apache.nutch.crawl.Injector.run(Injector.java:379)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
    at org.apache.nutch.crawl.Injector.main(Injector.java:369)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at com.google.cloud.hadoop.services.agent.job.shim.HadoopRunClassShim.main(HadoopRunClassShim.java:19)

I know it has to do with the filesystem. How do I access the gcp filesystem and hadoop file?

Follow up: I have made some progress with this config:

{ "reference": { "projectId": "ageless-valor-174413", "jobId": "108a7d43-671a-4f61-8ba8-b87010a8a823" }, "placement": { "clusterName": "first-cluster", "clusterUuid": "f3795563-bd44-4896-bec7-0eb81a3f685a" }, "status": { "state": "ERROR", "details": "Google Cloud Dataproc Agent reports job failure. If logs are available, they can be found in 'gs://dataproc-60f583ce-a087-42b1-a62e-4319584631d3-asia/google-cloud-dataproc-metainfo/f3795563-bd44-4896-bec7-0eb81a3f685a/jobs/108a7d43-671a-4f61-8ba8-b87010a8a823/driveroutput'.", "stateStartTime": "2017-07-28T18:59:13.518Z" }, "statusHistory": [ { "state": "PENDING", "stateStartTime": "2017-07-28T18:58:57.660Z" }, { "state": "SETUP_DONE", "stateStartTime": "2017-07-28T18:59:00.811Z" }, { "state": "RUNNING", "stateStartTime": "2017-07-28T18:59:02.347Z" } ], "driverOutputResourceUri": "gs://dataproc-60f583ce-a087-42b1-a62e-4319584631d3-asia/google-cloud-dataproc-metainfo/f3795563-bd44-4896-bec7-0eb81a3f685a/jobs/108a7d43-671a-4f61-8ba8-b87010a8a823/driveroutput", "driverControlFilesUri": "gs://dataproc-60f583ce-a087-42b1-a62e-4319584631d3-asia/google-cloud-dataproc-metainfo/f3795563-bd44-4896-bec7-0eb81a3f685a/jobs/108a7d43-671a-4f61-8ba8-b87010a8a823/", "hadoopJob": { "mainClass": "org.apache.nutch.crawl.Injector", "args": [ "https://console.cloud.google.com/storage/browser/dataproc-60f583ce-a087-42b1-a62e-4319584631d3-asia/deploy/urls/", "https://console.cloud.google.com/storage/browser/dataproc-60f583ce-a087-42b1-a62e-4319584631d3-asia/deploy/crawlDb/" ], "jarFileUris": [ "gs://dataproc-60f583ce-a087-42b1-a62e-4319584631d3-asia/deploy/apache-nutch-1.12-SNAPSHOT.job" ], "loggingConfig": {} } }

但我现在收到这个错误:

17/07/28 18:59:11 INFO crawl.Injector: Injector: starting at 2017-07-28 18:59:11
17/07/28 18:59:11 INFO crawl.Injector: Injector: crawlDb: https://console.cloud.google.com/storage/browser/dataproc-60f583ce-a087-42b1-a62e-4319584631d3-asia/deploy/urls
17/07/28 18:59:11 INFO crawl.Injector: Injector: urlDir: https://console.cloud.google.com/storage/browser/dataproc-60f583ce-a087-42b1-a62e-4319584631d3-asia/deploy/crawlDb
17/07/28 18:59:11 INFO Configuration.deprecation: mapred.temp.dir is deprecated. Instead, use mapreduce.cluster.temp.dir
17/07/28 18:59:11 INFO crawl.Injector: Injector: Converting injected urls to crawl db entries.
17/07/28 18:59:11 INFO gcs.GoogleHadoopFileSystemBase: GHFS version: 1.6.1-hadoop2
17/07/28 18:59:11 ERROR crawl.Injector: Injector: java.lang.IllegalArgumentException: Wrong FS: https://console.cloud.google.com/storage/browser/dataproc-60f583ce-a087-42b1-a62e-4319584631d3-asia/deploy/urls, expected: hdfs://first-cluster-m
    at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:648)
    at org.apache.hadoop.hdfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:194)
    at org.apache.hadoop.hdfs.DistributedFileSystem.access$000(DistributedFileSystem.java:106)
    at org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1305)
    at org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1301)
    at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
    at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1301)
    at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1426)
    at org.apache.nutch.crawl.Injector.inject(Injector.java:298)
    at org.apache.nutch.crawl.Injector.run(Injector.java:379)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
    at org.apache.nutch.crawl.Injector.main(Injector.java:369)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at com.google.cloud.hadoop.services.agent.job.shim.HadoopRunClassShim.main(HadoopRunClassShim.java:19)

最佳答案

您可以通过使用正确的方案 ( gs ) 引用谷歌云存储文件，然后将默认文件系统更改为谷歌云存储来解决此问题。

第 1 步:

替换
https://console.cloud.google.com/storage/browser/dataproc-60f583ce-a087-42b1-a62e-4319584631d3-asia/deploy/urls
与
gs://dataproc-60f583ce-a087-42b1-a62e-4319584631d3-asia/deploy/urls

第 2 步:

将以下属性添加到您的 nutch-site.xml文件:

<property> <name>fs.defaultFS</name> <value>gs://dataproc-60f583ce-a087-42b1-a62e-4319584631d3-asia</value> <description>The name of the default file system. A URI whose scheme and authority determine the FileSystem implementation. The uri's scheme determines the config property (fs.SCHEME.impl) naming the FileSystem implementation class. The uri's authority is used to determine the host, port, etc. for a filesystem.</description> </property>

(此属性在旧版本的 hadoop 中称为“fs.default.name”。)

关于hadoop - 如何在 google dataproc 上部署 nutch 作业？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/45315412/

hadoop - 如何在 google dataproc 上部署 nutch 作业？

第 1 步:

第 2 步:

上一篇：hadoop - 使用 Spark 2.2.0 从 Hive metastore 2.x 读取

下一篇：hadoop - 构造远程 block 的Spark I/O错误