java - 如果我更新 url 过滤器文本,我需要从命令行调用什么 Nutch 命令

标签 java mapreduce nutch web-crawler

Nutch 大师,

如果我更改 robots.txtregex-urlfilter.txt 等文件以及任何此类资源,我需要调用哪个命令?

我从坚果的说明中不确定。我猜这是解析器的工作,但我不确定。

卡提克

根据说明

# echo " crawl one-step crawler for intranets"
  echo " inject     inject new urls into the database"
  echo " hostinject     creates or updates an existing host table from a text file"
  echo " generate   generate new batches to fetch from crawl db"
  echo " fetch      fetch URLs marked during generate"
  echo " parse      parse URLs marked during fetch"
  echo " updatedb   update web table after parsing"
  echo " updatehostdb   update host table after parsing"
  echo " readdb     read/dump records from page database"
  echo " readhostdb     display entries from the hostDB"
  echo " elasticindex   run the elasticsearch indexer"
  echo " solrindex  run the solr indexer on parsed batches"
  echo " solrdedup  remove duplicates from solr"
  echo " parsechecker   check the parser for a given url"
  echo " indexchecker   check the indexing filters for a given url"
  echo " plugin     load a plugin and run one of its classes main()"
  echo " nutchserver    run a (local) Nutch server on a user defined port"
  echo " junit          runs the given JUnit test"
  echo " or"
  echo " CLASSNAME  run the class named CLASSNAME"
  echo "Most commands print help when invoked w/o parameters."

最佳答案

如果更改 regex-urlfilter.txt 文件,则需要更新 nutch 作业文件。这可以这样做:

jar -uvf /usr/local/nutch-1.2/nutch-1.2.job <path to regex-urlfilter.txt>

关于java - 如果我更新 url 过滤器文本,我需要从命令行调用什么 Nutch 命令,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/25538609/

相关文章:

java - JTextPane 中未显示元素符号

java - 除以 0 的正则表达式

javascript - 根据 Typescript 中的值对所有 JSON 键进行分组

hadoop - 如何将每个节点的任务数设置为1

hadoop - mapreduce:可以减少阶段 "emit"吗?

apache - Apache Nutch重新启动爬网

java - 如何表示这个基本表达式呢?

java - 在 JFrame 中使用坐标平面

eclipse - Nutch 无法获取 UTF-8 字符

java - 不能在 Apache nutch 2.3 中运行 ant 运行时