hadoop - 如何在 google dataproc 上部署 nutch 作业?

标签 hadoop nutch google-cloud-dataproc

我一直在尝试在我的 Google Hadoop dataproc 集群上部署一个 nutch 作业(使用自定义插件),但我遇到了很多错误(我怀疑是一些基本错误)。

我需要有关如何执行此操作的分步明确指南。该指南应包括如何在 gs 存储桶和本地文件系统 (Windows 7) 中设置权限和访问文件。

我试过这个配置但没有成功:

Region: global
Cluster: first-cluster
Job type: Hadoop
Jar files gs://dataproc-60f583ce-a087-42b1-a62e-4319584631d3-asia/depl‌​oy/apache-nutch-1.12‌​-SNAPSHOT.job
Main class or jar: gs://dataproc-60f583ce-a087-42b1-a62e-4319584631d3-<p>asia/ deploy/bin/nutch
Arguments: gs://dataproc-60f583ce-a087-42b1-a62e-4319584631d3-asia/seed‌​.txt, -depth 4

我也试过:

Region: global 
Cluster: first-cluster
Job type: Hadoop
Jar files gs://dataproc-60f583ce-a087-42b1-a62e-4319584631d3-asia/depl‌​oy/apache-nutch-1.12‌​-SNAPSHOT.job
Main class or jar: gs://dataproc-60f583ce-a087-42b1-a62e-4319584631d3-asia/ deploy/bin/crawl
Arguments: gs://dataproc-60f583ce-a087-42b1-a62e-4319584631d3-asia/seed‌​.txt, -depth 4

和:

Region: global
Cluster: first-cluster
Job type: Hadoop
Jar files gs://dataproc-60f583ce-a087-42b1-a62e-4319584631d3-asia/depl‌​oy/apache-nutch-1.12‌​-SNAPSHOT.job
Main class or jar: org.apache.nutch.crawl.Crawl
Arguments: gs://dataproc-60f583ce-a087-42b1-a62e-4319584631d3-asia/seed‌​.txt, -depth 4

跟进:我取得了一些进展,但我现在收到此错误:

17/07/28 18:59:11 INFO crawl.Injector: Injector: starting at 2017-07-28 18:59:11
17/07/28 18:59:11 INFO crawl.Injector: Injector: crawlDb: https://console.cloud.google.com/storage/browser/dataproc-60f583ce-a087-42b1-a62e-4319584631d3-asia/deploy/urls
17/07/28 18:59:11 INFO crawl.Injector: Injector: urlDir: https://console.cloud.google.com/storage/browser/dataproc-60f583ce-a087-42b1-a62e-4319584631d3-asia/deploy/crawlDb
17/07/28 18:59:11 INFO Configuration.deprecation: mapred.temp.dir is deprecated. Instead, use mapreduce.cluster.temp.dir
17/07/28 18:59:11 INFO crawl.Injector: Injector: Converting injected urls to crawl db entries.
17/07/28 18:59:11 INFO gcs.GoogleHadoopFileSystemBase: GHFS version: 1.6.1-hadoop2
17/07/28 18:59:11 ERROR crawl.Injector: Injector: java.lang.IllegalArgumentException: Wrong FS: https://console.cloud.google.com/storage/browser/dataproc-60f583ce-a087-42b1-a62e-4319584631d3-asia/deploy/urls, expected: hdfs://first-cluster-m
    at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:648)
    at org.apache.hadoop.hdfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:194)
    at org.apache.hadoop.hdfs.DistributedFileSystem.access$000(DistributedFileSystem.java:106)
    at org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1305)
    at org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1301)
    at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
    at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1301)
    at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1426)
    at org.apache.nutch.crawl.Injector.inject(Injector.java:298)
    at org.apache.nutch.crawl.Injector.run(Injector.java:379)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
    at org.apache.nutch.crawl.Injector.main(Injector.java:369)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at com.google.cloud.hadoop.services.agent.job.shim.HadoopRunClassShim.main(HadoopRunClassShim.java:19)

I know it has to do with the filesystem. How do I access the gcp filesystem and hadoop file?

Follow up: I have made some progress with this config:

{ "reference": { "projectId": "ageless-valor-174413", "jobId": "108a7d43-671a-4f61-8ba8-b87010a8a823" }, "placement": { "clusterName": "first-cluster", "clusterUuid": "f3795563-bd44-4896-bec7-0eb81a3f685a" }, "status": { "state": "ERROR", "details": "Google Cloud Dataproc Agent reports job failure. If logs are available, they can be found in 'gs://dataproc-60f583ce-a087-42b1-a62e-4319584631d3-asia/google-cloud-dataproc-metainfo/f3795563-bd44-4896-bec7-0eb81a3f685a/jobs/108a7d43-671a-4f61-8ba8-b87010a8a823/driveroutput'.", "stateStartTime": "2017-07-28T18:59:13.518Z" }, "statusHistory": [ { "state": "PENDING", "stateStartTime": "2017-07-28T18:58:57.660Z" }, { "state": "SETUP_DONE", "stateStartTime": "2017-07-28T18:59:00.811Z" }, { "state": "RUNNING", "stateStartTime": "2017-07-28T18:59:02.347Z" } ], "driverOutputResourceUri": "gs://dataproc-60f583ce-a087-42b1-a62e-4319584631d3-asia/google-cloud-dataproc-metainfo/f3795563-bd44-4896-bec7-0eb81a3f685a/jobs/108a7d43-671a-4f61-8ba8-b87010a8a823/driveroutput", "driverControlFilesUri": "gs://dataproc-60f583ce-a087-42b1-a62e-4319584631d3-asia/google-cloud-dataproc-metainfo/f3795563-bd44-4896-bec7-0eb81a3f685a/jobs/108a7d43-671a-4f61-8ba8-b87010a8a823/", "hadoopJob": { "mainClass": "org.apache.nutch.crawl.Injector", "args": [ "https://console.cloud.google.com/storage/browser/dataproc-60f583ce-a087-42b1-a62e-4319584631d3-asia/deploy/urls/", "https://console.cloud.google.com/storage/browser/dataproc-60f583ce-a087-42b1-a62e-4319584631d3-asia/deploy/crawlDb/" ], "jarFileUris": [ "gs://dataproc-60f583ce-a087-42b1-a62e-4319584631d3-asia/deploy/apache-nutch-1.12-SNAPSHOT.job" ], "loggingConfig": {} } }

但我现在收到这个错误:

17/07/28 18:59:11 INFO crawl.Injector: Injector: starting at 2017-07-28 18:59:11
17/07/28 18:59:11 INFO crawl.Injector: Injector: crawlDb: https://console.cloud.google.com/storage/browser/dataproc-60f583ce-a087-42b1-a62e-4319584631d3-asia/deploy/urls
17/07/28 18:59:11 INFO crawl.Injector: Injector: urlDir: https://console.cloud.google.com/storage/browser/dataproc-60f583ce-a087-42b1-a62e-4319584631d3-asia/deploy/crawlDb
17/07/28 18:59:11 INFO Configuration.deprecation: mapred.temp.dir is deprecated. Instead, use mapreduce.cluster.temp.dir
17/07/28 18:59:11 INFO crawl.Injector: Injector: Converting injected urls to crawl db entries.
17/07/28 18:59:11 INFO gcs.GoogleHadoopFileSystemBase: GHFS version: 1.6.1-hadoop2
17/07/28 18:59:11 ERROR crawl.Injector: Injector: java.lang.IllegalArgumentException: Wrong FS: https://console.cloud.google.com/storage/browser/dataproc-60f583ce-a087-42b1-a62e-4319584631d3-asia/deploy/urls, expected: hdfs://first-cluster-m
    at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:648)
    at org.apache.hadoop.hdfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:194)
    at org.apache.hadoop.hdfs.DistributedFileSystem.access$000(DistributedFileSystem.java:106)
    at org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1305)
    at org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1301)
    at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
    at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1301)
    at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1426)
    at org.apache.nutch.crawl.Injector.inject(Injector.java:298)
    at org.apache.nutch.crawl.Injector.run(Injector.java:379)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
    at org.apache.nutch.crawl.Injector.main(Injector.java:369)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at com.google.cloud.hadoop.services.agent.job.shim.HadoopRunClassShim.main(HadoopRunClassShim.java:19)

最佳答案

您可以通过使用正确的方案 ( gs ) 引用谷歌云存储文件,然后将默认文件系统更改为谷歌云存储来解决此问题。

第 1 步:

替换
https://console.cloud.google.com/storage/browser/dataproc-60f583ce-a087-42b1-a62e-4319584631d3-asia/deploy/urls

gs://dataproc-60f583ce-a087-42b1-a62e-4319584631d3-asia/deploy/urls

第 2 步:

将以下属性添加到您的 nutch-site.xml文件:

<property> <name>fs.defaultFS</name> <value>gs://dataproc-60f583ce-a087-42b1-a62e-4319584631d3-asia</value> <description>The name of the default file system. A URI whose scheme and authority determine the FileSystem implementation. The uri's scheme determines the config property (fs.SCHEME.impl) naming the FileSystem implementation class. The uri's authority is used to determine the host, port, etc. for a filesystem.</description> </property>

(此属性在旧版本的 hadoop 中称为“fs.default.name”。)

关于hadoop - 如何在 google dataproc 上部署 nutch 作业?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/45315412/

相关文章:

google-cloud-platform - 运行时错误: Error building custom image

google-cloud-platform - GCP Dataproc - 容器指标不一致 - YARN UI 与 Spark UI

Hadoop - 如何从实现可写接口(interface)切换到使用 Avro 对象?

hadoop - 由于 UTC 时间, hive 中的日期转换问题

hadoop - 使用 hadoop tarball(CDH4.3) 安装 MR1

java - 无法让 apache nutch 爬行 - 权限和 JAVA_HOME 可疑

java - 如果我更新 url 过滤器文本,我需要从命令行调用什么 Nutch 命令

hadoop - 在 Hadoop 中将多个序列文件合并为一个序列文件

amazon-web-services - Nutch搜寻器无法缩放较大的网址

java - 为什么 google dataproc 不提取 coreNLP jar,尽管它们包含在 POM 文件中?