我尝试修改代码crawler4j-Quickstart example
我想抓取以下链接
https://www.google.com/search?biw=1366&bih=645&tbm=nws&q=%22obama%22&oq=%22obama%22&gs_l=serp.3..0l5.825041.826084.0.826833.5.5.0.0.0.0.187.572.2j3.5.0....0...1c.1.64.serp..0.3.333...0i13k1.Tmd9nARKIrU
这是一个带有关键字奥巴马的 Google 新闻搜索链接
我尝试修改mycrawler.java
@Override
public boolean shouldVisit(Page referringPage, WebURL url) {
String href = url.getURL().toLowerCase();
return !FILTERS.matcher(href).matches()
&& href.startsWith("https://www.google.com/search?biw=1366&bih=645&tbm=nws&q=%22obama%22&oq=%22obama%22&gs_l=serp.3..0l5.825041.826084.0.826833.5.5.0.0.0.0.187.572.2j3.5.0....0...1c.1.64.serp..0.3.333...0i13k1.Tmd9nARKIrU/");
}
还有controller.java
/*
* For each crawl, you need to add some seed urls. These are the first
* URLs that are fetched and then the crawler starts following links
* which are found in these pages
*/
//controller.addSeed("http://www.ics.uci.edu/~lopes/");
// controller.addSeed("http://www.ics.uci.edu/~welling/");
controller.addSeed("https://www.google.com/search?biw=1366&bih=645&tbm=nws&q=%22obama%22&oq=%22obama%22&gs_l=serp.3..0l5.825041.826084.0.826833.5.5.0.0.0.0.187.572.2j3.5.0....0...1c.1.64.serp..0.3.333...0i13k1.Tmd9nARKIrU");
/*
* Start the crawl. This is a blocking operation, meaning that your code
* will reach the line after this only when crawling is finished.
*/
controller.start(MyCrawler.class, numberOfCrawlers);
<小时/>
然后,它显示错误
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
BUILD SUCCESSFUL (total time: 43 seconds)
<小时/>
我的代码修改有误吗?
<小时/>更新
我尝试使用除谷歌搜索链接之外的其他网址。它有效。 我猜它无法抓取谷歌搜索链接。有解决这个问题的想法吗?
最佳答案
您收到的错误与您的代码修改无关。 相反,它与不正确的配置和丢失的 jar 有关。
为了让 SLF4J 执行日志记录,需要 SLF4J 绑定(bind),否则它将使用 NOP 记录器实现,正如您在错误消息中看到的那样。
要解决此问题,请将 SLF4J 绑定(bind) jar 文件添加到您的项目中,例如 slf4j-simple-<version>.jar
您可以引用SLF4J Manual以获得更详细的解释。
更新
我认为您不可以根据 Google's robots.txt 抓取 Google 搜索结果不允许其网站带有后缀 /search
被抓取,也在他们的 TOS 中.
Don’t misuse our Services. For example, don’t interfere with our Services or try to access them using a method other than the interface and the instructions that we provide. You may use our Services only as permitted by law, including applicable export and re-export control laws and regulations. We may suspend or stop providing our Services to you if you do not comply with our terms or policies or if we are investigating suspected misconduct.
您可以考虑使用Google's Custom Search API遵守其服务条款。
关于java - 如何在crawler4j中适配我要爬取的URL,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/39461257/