java - 如何在crawler4j中适配我要爬取的URL

标签 java parsing web-crawler jsoup crawler4j

我尝试修改代码crawler4j-Quickstart example

我想抓取以下链接

https://www.google.com/search?biw=1366&bih=645&tbm=nws&q=%22obama%22&oq=%22obama%22&gs_l=serp.3..0l5.825041.826084.0.826833.5.5.0.0.0.0.187.572.2j3.5.0....0...1c.1.64.serp..0.3.333...0i13k1.Tmd9nARKIrU

这是一个带有关键字奥巴马的 Google 新闻搜索链接

我尝试修改mycrawler.java

 @Override
 public boolean shouldVisit(Page referringPage, WebURL url) {
     String href = url.getURL().toLowerCase();
     return !FILTERS.matcher(href).matches()
            && href.startsWith("https://www.google.com/search?biw=1366&bih=645&tbm=nws&q=%22obama%22&oq=%22obama%22&gs_l=serp.3..0l5.825041.826084.0.826833.5.5.0.0.0.0.187.572.2j3.5.0....0...1c.1.64.serp..0.3.333...0i13k1.Tmd9nARKIrU/");
 }

还有controller.java

 /*
  * For each crawl, you need to add some seed urls. These are the first
  * URLs that are fetched and then the crawler starts following links
  * which are found in these pages
  */
  //controller.addSeed("http://www.ics.uci.edu/~lopes/");
  // controller.addSeed("http://www.ics.uci.edu/~welling/");
    controller.addSeed("https://www.google.com/search?biw=1366&bih=645&tbm=nws&q=%22obama%22&oq=%22obama%22&gs_l=serp.3..0l5.825041.826084.0.826833.5.5.0.0.0.0.187.572.2j3.5.0....0...1c.1.64.serp..0.3.333...0i13k1.Tmd9nARKIrU");

 /*
  * Start the crawl. This is a blocking operation, meaning that your code
  * will reach the line after this only when crawling is finished.
  */
  controller.start(MyCrawler.class, numberOfCrawlers);
<小时/>

然后,它显示错误

SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
BUILD SUCCESSFUL (total time: 43 seconds)
<小时/>

我的代码修改有误吗?

<小时/>

更新

我尝试使用除谷歌搜索链接之外的其他网址。它有效。 我猜它无法抓取谷歌搜索链接。有解决这个问题的想法吗?

最佳答案

您收到的错误与您的代码修改无关。 相反,它与不正确的配置和丢失的 jar 有关。

为了让 SLF4J 执行日志记录,需要 SLF4J 绑定(bind),否则它将使用 NOP 记录器实现,正如您在错误消息中看到的那样。

要解决此问题,请将 SLF4J 绑定(bind) jar 文件添加到您的项目中,例如 slf4j-simple-<version>.jar

您可以引用SLF4J Manual以获得更详细的解释。

更新

我认为您不可以根据 Google's robots.txt 抓取 Google 搜索结果不允许其网站带有后缀 /search被抓取,也在他们的 TOS 中.

Don’t misuse our Services. For example, don’t interfere with our Services or try to access them using a method other than the interface and the instructions that we provide. You may use our Services only as permitted by law, including applicable export and re-export control laws and regulations. We may suspend or stop providing our Services to you if you do not comply with our terms or policies or if we are investigating suspected misconduct.

您可以考虑使用Google's Custom Search API遵守其服务条款。

关于java - 如何在crawler4j中适配我要爬取的URL,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/39461257/

相关文章:

ios - 将 .csv 文件解析为 UITextView 时出错

web-crawler - Nutch Crawler 不检索新闻文章内容

java - !FILTER 是什么意思?

java - 使用 Rome 库获取 rss 的图像 url

python - 奇怪的分隔符

java - 方法难点

c++ - CString解析回车

java - 抓取网页编码问题-以字节为单位的负值

java - 定制的maven可部署(删除了一些依赖项的war)

java - 我们什么时候应该在 Java 中使用 Singleton 类?