java - Nutch 不会抓取带有查询字符串参数的 URL

标签 java web-crawler nutch

我正在使用 Nutch1.9 并尝试使用单独的命令进行爬网。从输出中可以看出,当进入第二级生成器时,返回 0 条记录。 有人遇到过这个问题吗?过去两天我被困在这里。已经搜索了所有可能的选项。任何线索/帮助将不胜感激。

<br>#######  INJECT   ######<br>
Injector: starting at 2015-04-08 17:36:20 <br>
Injector: crawlDb: crawl/crawldb<br>
Injector: urlDir: urls<br>
Injector: Converting injected urls to crawl db entries.<br>
Injector: overwrite: false<br>
Injector: update: false<br>
Injector: Total number of urls rejected by filters: 0<br>
Injector: Total number of urls after normalization: 1<br>
Injector: Total new urls injected: 1<br>
Injector: finished at 2015-04-08 17:36:21, elapsed: 00:00:01<br>
####  GENERATE  ###<br>
Generator: starting at 2015-04-08 17:36:22<br>
Generator: Selecting best-scoring urls due for fetch.<br>
Generator: filtering: true<br>
Generator: normalizing: true<br>
Generator: topN: 100000<br>
Generator: jobtracker is 'local', generating exactly one partition.<br>
Generator: Partitioning selected urls for politeness.<br>
Generator: segment: crawl/segments/20150408173625<br>
Generator: finished at 2015-04-08 17:36:26, elapsed: 00:00:03<br>
crawl/segments/20150408173625<br>
#### FETCH  ####<br>
Fetcher: starting at 2015-04-08 17:36:26<br>
Fetcher: segment: crawl/segments/20150408173625<br>
Using queue mode : byHost<br>
Fetcher: threads: 10<br>
Fetcher: time-out divisor: 2<br>
QueueFeeder finished: total 1 records + hit by time limit :0<br>
Using queue mode : byHost<br>
fetching https://ifttt.com/recipes/search?q=SmartThings (queue crawl delay=5000ms)<br>
Using queue mode : byHost<br>
Using queue mode : byHost<br>
Thread FetcherThread has no more work available<br>
-finishing thread FetcherThread, activeThreads=1<br>
Using queue mode : byHost<br>
Thread FetcherThread has no more work available<br>
-finishing thread FetcherThread, activeThreads=1<br>
Using queue mode : byHost<br>
Thread FetcherThread has no more work available<br>
-finishing thread FetcherThread, activeThreads=1<br>
Using queue mode : byHost<br>
Thread FetcherThread has no more work available<br>
-finishing thread FetcherThread, activeThreads=1<br>
Using queue mode : byHost<br>
Thread FetcherThread has no more work available<br>
-finishing thread FetcherThread, activeThreads=1<br>
Thread FetcherThread has no more work available<br>
-finishing thread FetcherThread, activeThreads=1<br>
Using queue mode : byHost<br>
Thread FetcherThread has no more work available<br>
-finishing thread FetcherThread, activeThreads=1<br>
Using queue mode : byHost<br>
Thread FetcherThread has no more work available<br>
-finishing thread FetcherThread, activeThreads=1<br>
Using queue mode : byHost<br>
Fetcher: throughput threshold: -1<br>
Thread FetcherThread has no more work available<br>
Fetcher: throughput threshold retries: 5<br>
-finishing thread FetcherThread, activeThreads=1<br>
-activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0, fetchQueues.getQueueCount=1<br>
-activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0, fetchQueues.getQueueCount=1<br>
-activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0, fetchQueues.getQueueCount=1<br>
-activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0, fetchQueues.getQueueCount=1<br>
Thread FetcherThread has no more work available<br>
-finishing thread FetcherThread, activeThreads=0<br>
-activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0, fetchQueues.getQueueCount=0<br>
-activeThreads=0<br>
Fetcher: finished at 2015-04-08 17:36:33, elapsed: 00:00:06<br>
#### PARSE ####<br>
ParseSegment: starting at 2015-04-08 17:36:33<br>
ParseSegment: segment: crawl/segments/20150408173625<br>
ParseSegment: finished at 2015-04-08 17:36:35, elapsed: 00:00:01<br>
########   UPDATEDB   ##########<br>
CrawlDb update: starting at 2015-04-08 17:36:36<br>
CrawlDb update: db: crawl/crawldb<br>
CrawlDb update: segments: [crawl/segments/20150408173625]<br>
CrawlDb update: additions allowed: true<br>
CrawlDb update: URL normalizing: false<br>
CrawlDb update: URL filtering: false<br>
CrawlDb update: 404 purging: false<br>
CrawlDb update: Merging segment data into db.<br>
CrawlDb update: finished at 2015-04-08 17:36:37, elapsed: 00:00:01<br>
#####  GENERATE  ######<br>
Generator: starting at 2015-04-08 17:36:38<br>
Generator: Selecting best-scoring urls due for fetch.<br>
Generator: filtering: true<br>
Generator: normalizing: true<br>
Generator: topN: 100000<br>
Generator: jobtracker is 'local', generating exactly one partition.<br>
Generator: 0 records selected for fetching, exiting ...<br>
#######   EXTRACT  #########<br>
crawl/segments/20150408173625<br>
#### Segments #####<br>
20150408173625<br>

编辑: 所以我用查询参数( http://queue.acm.org/detail.cfm?id=988409 )检查了另一个 URL,它很好地抓取了它......

所以这意味着它没有抓取我的原始网址:https://ifttt.com/recipes/search?q=SmartThings&ac=true

我尝试过在没有查询字符串的情况下抓取此 ifttt 域的 url,nutch 成功抓取了它...

我认为问题在于使用查询字符串抓取 https 网站。 关于这个问题有什么帮助吗?

最佳答案

默认情况下,带有查询参数的链接将被忽略或过滤掉。要启用带参数的抓取 url,请转至 conf/regex-urlfilter.txt 并通过在行首添加 # 来注释以下行。

# skip URLs containing certain characters as probable queries, etc.
#-[?*!@=]

关于java - Nutch 不会抓取带有查询字符串参数的 URL,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/29514441/

相关文章:

javascript - 过滤掉广告系统的机器人和蜘蛛。封锁太过分了

java - "Fatal Error"以下 nutch 教程 "markup in the document following the root must be well formed"

c# - 使用正则表达式定位多个嵌套的 If 语句

java - 在 Java 中构建具有可变列数的PreparedStatement,用于将数据插入数据库

knockout.js - 如何提高单页应用的SEO

python - python 中的 facebook 网络爬虫

hadoop - 安装 Nutch 1.3 和 Hadoop

http - Nutch:通过在 header 中放置 cookie 进行身份验证

java - Canvas 对象未显示,但位置在 Java Applet 中正确更新

java - 在 Android 中使用 RegEx 拆分字符串