在 Java 中使用 Jsoup 时出现 java.lang.IllegalArgumentException

标签 java jsoup web-crawler

我编写了代码来抓取网页中的图像并将其保存在网页中。由于某种原因我得到 我不确定如何修复的错误。

我正在使用一种方法来确保我索引的每个图像确实存在,所以我不确定为什么会发生这种情况。

这是我的代码:

import org.jsoup.Jsoup;
import org.jsoup.helper.Validate;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import java.net.*;
import java.awt.Image;
import java.awt.image.RenderedImage;
import java.io.*;

import java.io.IOException;

import javax.imageio.ImageIO;
import javax.imageio.ImageReader;
import javax.imageio.stream.ImageInputStream;

public class jsoup {
    public static void main(String[] args) throws IOException {
    crawl("http://www.istockphoto.com/photo");
}

public static void crawl(String crawlurl) throws IOException{
    Document doc = Jsoup.connect(crawlurl).get();
    getImgFromLinks(doc);
}

public static void getImgFromLinks(Document doc) throws IOException{
    Elements links = doc.select("a[href]");
    //System.out.println(links);

    for(int i=0;i<links.size();i++){
        if(exists(links.get(i).attr("href"))){
            System.out.println("crawled: " + links.get(i).attr("href"));
            getImages(doc, links.get(i).attr("href"));
        }else{
            System.out.println("I couldnt crawl: " + links.get(i).attr("href"));
        }
    }
}

public static String smartUrl(String url, String src) {
    if(exists(src)){
        return(src);
    }else{
        return(url + src);
    }
}


public static void getImages(Document doc, String url) throws IOException{



      for(int i=0; i<doc.getElementsByTag("img").size();i++){
            Element image = doc.select("img").get(i);
            String imgsrc = image.attr("src");
            if(imgsrc.toLowerCase().contains("png") || imgsrc.toLowerCase().contains("jpg") || imgsrc.toLowerCase().contains("jpeg") || imgsrc.toLowerCase().contains("gif")){

            int slashIndex = smartUrl(url, imgsrc).lastIndexOf('/');
            String finalUrl = smartUrl(url, imgsrc).substring(slashIndex);

            URL imgurl = new URL(smartUrl(url, imgsrc));

            if(exists(imgurl.toString())){
            Image crawledimg = ImageIO.read(imgurl);


            ImageIO.write((RenderedImage) crawledimg, "gif",new File("/Users/Jonathan/Desktop/crawledimages" + finalUrl));


            System.out.println("I got an image from:" + url + " Image Name: " + finalUrl);
            }

        }
        }


}


public static boolean exists(String URLName) {
    try {
      HttpURLConnection.setFollowRedirects(false);

    //HttpURLConnection.setInstanceFollowRedirects(false);
      HttpURLConnection con =
         (HttpURLConnection) new URL(URLName).openConnection();
      con.setRequestMethod("HEAD");
      return (con.getResponseCode() == HttpURLConnection.HTTP_OK);
    }
    catch (Exception e) {
       return false;
    }
  }
}

这是输出:

crawled: http://www.istockphoto.com/
I got an image from:http://www.istockphoto.com/ Image Name: /blank.gif
I got an image from:http://www.istockphoto.com/ Image Name: /blank.gif
I got an image from:http://www.istockphoto.com/ Image Name: /blank.gif
I got an image from:http://www.istockphoto.com/ Image Name: /blank.gif
I got an image from:http://www.istockphoto.com/ Image Name: /blank.gif
I got an image from:http://www.istockphoto.com/ Image Name: /blank.gif
I got an image from:http://www.istockphoto.com/ Image Name: /blank.gif
I got an image from:http://www.istockphoto.com/ Image Name: /blank.gif
I got an image from:http://www.istockphoto.com/ Image Name: /facebook.png
I got an image from:http://www.istockphoto.com/ Image Name: /twitter.png
I got an image from:http://www.istockphoto.com/ Image Name: /blank.gif
I got an image from:http://www.istockphoto.com/ Image Name: /cartWhite.png
I couldnt crawl: #
I couldnt crawl: http://www.istockphoto.com/sign-in/aHR0cCUzQSUyRiUyRnd3dy5pc3RvY2twaG90by5jb20lMkZwaG90bw==
I couldnt crawl: http://www.istockphoto.com/join/aHR0cCUzQSUyRiUyRnd3dy5pc3RvY2twaG90by5jb20lMkZwaG90bw==
crawled: http://www.istockphoto.com/photo
I got an image from:http://www.istockphoto.com/photo Image Name: /blank.gif
I got an image from:http://www.istockphoto.com/photo Image Name: /blank.gif
I got an image from:http://www.istockphoto.com/photo Image Name: /blank.gif
I got an image from:http://www.istockphoto.com/photo Image Name: /blank.gif
I got an image from:http://www.istockphoto.com/photo Image Name: /blank.gif
I got an image from:http://www.istockphoto.com/photo Image Name: /blank.gif
I got an image from:http://www.istockphoto.com/photo Image Name: /blank.gif
I got an image from:http://www.istockphoto.com/photo Image Name: /blank.gif
I got an image from:http://www.istockphoto.com/photo Image Name: /facebook.png
I got an image from:http://www.istockphoto.com/photo Image Name: /twitter.png
I got an image from:http://www.istockphoto.com/photo Image Name: /blank.gif
 Exception in thread "main" java.lang.IllegalArgumentException: im == null!
at javax.imageio.ImageIO.write(ImageIO.java:1457)
at javax.imageio.ImageIO.write(ImageIO.java:1527)
at jsoup.getImages(jsoup.java:68)
at jsoup.getImgFromLinks(jsoup.java:34)
at jsoup.crawl(jsoup.java:24)
at jsoup.main(jsoup.java:19)

正在保存图像,直到出现错误为止。

有人知道如何解决这个问题吗?

此外,由于某种原因,页面上的相同图像会被保存多次。

感谢您的宝贵时间,

乔纳森·奥伦。

最佳答案

看起来null正在ImageIO.write()内传递

smartURL 函数有一个缺陷,您需要解决该缺陷。它不会根据从网页获取的图像 URL 创建预期的 URL。

例如: /static/images/cartWhite.png 将被 smartURL 转换为 http://www.istockphoto.com/photo/static/images/cartWhite.png,它不是图像,但同时它也不是错误页面。因此,crawledimg 引用 null 导致 IllegalArgumentException

解决此问题的一个快速解决方法是在 getImages() 内创建仅包含 http://www.istockphoto.comURL

图像将被保存多次,因为每个页面都有它们。您可以维护一个图像列表来避免这种情况发生。

我在您的代码中发现了另一个显示障碍,您将无法从您抓取的网页中检索任何其他图像。网站上的图像不以 *.jpg*.png 等结尾。因此,在开始之前,您需要研究网站上图像 URL 的模式。

关于在 Java 中使用 Jsoup 时出现 java.lang.IllegalArgumentException,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/15032011/

相关文章:

java - scriptlet 标签 <%= some code %> 和 <# some code %> 之间的确切区别是什么?

java - 如何在其基类中创建子类对象?

Java正则表达式

vba - VBA 的网络爬虫

python - 从 Web GUI 调用 scrapy

java - 通过用户输入动态更改 Log4J 的阈值

java - 禁用 PointLight 会引发 CapabilityNotSetException

java - jsoup select() 方法未找到

java - 将 HTML 解析为纯文本,但保留每个字符的标签信息

python - 通过scrapy下载时如何为图像指定自定义名称