java - 在Windows上安装Apache Nutch

标签 java hadoop solr nutch

我正在尝试将Apache Solr与Windows 7(64位)上的Apache Nutch 1.14集成,但是在尝试运行Nutch时出现错误。

我已经做过的事情:

  • 将JAVA_HOME env变量设置为:C:\ Program Files \ Java \ jdk1.8.0_25或C:\ Progra〜1 \ Java \ jdk1.8.0_25
  • 从以下位置下载Hadoop WinUtils文件:https://github.com/steveloughran/winutils/tree/master/hadoop-3.0.0/bin,将它们放在c:\ winutils \ bin中,将HADOOP_HOME env变量设置为c:\ winutil,然后将c:\ winutil \ bin文件夹添加到PATH。

  • (我尝试了Hadoop WinUtils 2.7.1,也没有成功)。

    我得到的错误:
    $ bin/crawl -i -D http://localhost:8983/solr/ -s urls/ TestCrawl 2
      Injecting seed URLs
      /home/apache-nutch-1.14/bin/nutch inject TestCrawl/crawldb urls/
      Injector: starting at 2018-06-20 07:14:47
      Injector: crawlDb: TestCrawl/crawldb
      Injector: urlDir: urls
      Injector: Converting injected urls to crawl db entries.
      Exception in thread "main" java.lang.UnsatisfiedLinkError: org.apache.hadoop.io.nativeio.NativeIO$Windows.access0(Ljava/lang/String;I)Z
        at org.apache.hadoop.io.nativeio.NativeIO$Windows.access0(Native Method)
        at org.apache.hadoop.io.nativeio.NativeIO$Windows.access(NativeIO.java:609)
        at org.apache.hadoop.fs.FileUtil.canRead(FileUtil.java:977)
        at org.apache.hadoop.util.DiskChecker.checkAccessByFileMethods(DiskChecker.java:187)
        at org.apache.hadoop.util.DiskChecker.checkDirAccess(DiskChecker.java:174)
        at org.apache.hadoop.util.DiskChecker.checkDir(DiskChecker.java:108)
        at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.confChanged(LocalDirAllocator.java:285)
        at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:344)
        at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:150)
        at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:131)
        at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:115)
        at org.apache.hadoop.mapred.LocalDistributedCacheManager.setup(LocalDistributedCacheManager.java:125)
        at org.apache.hadoop.mapred.LocalJobRunner$Job.<init>(LocalJobRunner.java:163)
        at org.apache.hadoop.mapred.LocalJobRunner.submitJob(LocalJobRunner.java:731)
        at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:240)
        at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1290)
        at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1287)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1746)
        at org.apache.hadoop.mapreduce.Job.submit(Job.java:1287)
        at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1308)
        at org.apache.nutch.crawl.Injector.inject(Injector.java:417)
        at org.apache.nutch.crawl.Injector.run(Injector.java:563)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
        at org.apache.nutch.crawl.Injector.main(Injector.java:528)
      Error running:
        /home/apache-nutch-1.14/bin/nutch inject TestCrawl/crawldb urls/
      Failed with exit value 1.
    

    http://www.java2s.com/Code/Jar/h/Downloadhadoopcore121jar.htm下载hadoop-core-1.1.2.jar文件并将其粘贴到NUTCH_HOME / lib文件夹后,出现以下错误:
    $ bin/crawl -i -D http://localhost:8983/solr/ -s urls/ TestCrawl 2
      Injecting seed URLs
      /home/apache-nutch-1.14/bin/nutch inject TestCrawl/crawldb urls/
      Injector: starting at 2018-06-20 23:19:49
      Injector: crawlDb: TestCrawl/crawldb
      Injector: urlDir: urls
      Injector: Converting injected urls to crawl db entries.
      Exception in thread "main" java.lang.NoSuchMethodError: org.apache.hadoop.mapreduce.Job.getInstance(Lorg/apache/hadoop/conf/Configuration;Ljava/lang/String;)Lorg/apache/hadoop/mapreduce/Job;
        at org.apache.nutch.crawl.Injector.inject(Injector.java:401)
        at org.apache.nutch.crawl.Injector.run(Injector.java:563)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
        at org.apache.nutch.crawl.Injector.main(Injector.java:528)
      Error running:
        /home/apache-nutch-1.14/bin/nutch inject TestCrawl/crawldb urls/
      Failed with exit value 1.
    

    如果未设置HADOOP_HOME变量,则将收到以下异常:
    Injector: java.io.IOException: (null) entry in command string: null chmod 0644 C:\cygwin64\home\apache-nutch-1.14\TestCrawl\crawldb\.locked
        at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:773)
        at org.apache.hadoop.util.Shell.execCommand(Shell.java:869)
        at org.apache.hadoop.util.Shell.execCommand(Shell.java:852)
        at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:733)
        at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.<init>(RawLocalFileSystem.java:225)
        at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.<init>(RawLocalFileSystem.java:209)
        at org.apache.hadoop.fs.RawLocalFileSystem.createOutputStreamWithMode(RawLocalFileSystem.java:307)
        at org.apache.hadoop.fs.RawLocalFileSystem.create(RawLocalFileSystem.java:296)
        at org.apache.hadoop.fs.RawLocalFileSystem.create(RawLocalFileSystem.java:328)
        at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSOutputSummer.<init>(ChecksumFileSystem.java:398)
        at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:461)
        at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:440)
        at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:911)
        at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:892)
        at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:854)
        at org.apache.hadoop.fs.FileSystem.createNewFile(FileSystem.java:1154)
        at org.apache.nutch.util.LockUtil.createLockFile(LockUtil.java:59)
        at org.apache.nutch.util.LockUtil.createLockFile(LockUtil.java:81)
        at org.apache.nutch.crawl.CrawlDb.lock(CrawlDb.java:178)
        at org.apache.nutch.crawl.Injector.inject(Injector.java:398)
        at org.apache.nutch.crawl.Injector.run(Injector.java:563)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
        at org.apache.nutch.crawl.Injector.main(Injector.java:528)
    
      Error running:
        /home/apache-nutch-1.14/bin/nutch inject TestCrawl//crawldb urls/
      Failed with exit value 127.
    

    我会很感激我能得到的任何帮助!

    最佳答案

    当您执行抓取时,只需执行以下命令

    bin/crawl -s urls/ TestCrawl/ 2
    

    然后您可以使用它(-D与类)
    bin/nutch index -Dsolr.server.url=http://localhost:8983/solr/YOURCORE TestCrawl/crawldb/ -linkdb TestCrawl/linkdb/ TestCrawl/segments/* -filter -normalize -deleteGone
    

    或者您可以在conf / nutch-site.xml中指定
    <property>
        <name>solr.server.url</name>
        <value>http://localhost:8983/solr/YOURCORE/</value>
        <description>Defines the Solr URL into which data should be indexed using the indexer-solr plugin.</description>
    </property> 
    

    关于java - 在Windows上安装Apache Nutch,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/50956644/

    相关文章:

    java - 在文件上写入 DOM 时 XMLNS 属性被删除

    hadoop - Ubuntu 16.04上的Hadoop 2.8.1-资源管理器在NameNode上崩溃

    hadoop - 选择配置单元执行引擎

    PHP mysql_fetch_assoc 仅返回 1

    ruby-on-rails - RSolr RequestError Solr Response solr 配置中的严重错误

    java - 是否可以将包含自定义 JavaFX 控件的 JAR 导入到 Scene Builder 中?

    java - 获取拨号器 Intent 的 RESULT_CANCELED

    java - ScatterChart 使用 LineMarker 作为 ScatterStyle 而不是仅使用 Marker

    hadoop - 在 Hive 中计数和分组

    solr - 将 URL 参数添加到 Nutch/Solr 索引和搜索结果