java - JSOUP java.io.IOException : Input is binary and unsupported

我有一个项目需要使用 JSOUP 进行网页抓取。我能够从我想要抓取的网站主页获取数据。但是，当我通过循环进入超链接并访问它来抓取更深入的页面时，我收到以下错误:

java.io.IOException: Input is binary and unsupported
    at org.jsoup.UncheckedIOException.<init>(UncheckedIOException.java:11)
    at org.jsoup.parser.CharacterReader.<init>(CharacterReader.java:38)
    at org.jsoup.parser.CharacterReader.<init>(CharacterReader.java:43)
    at org.jsoup.parser.TreeBuilder.initialiseParse(TreeBuilder.java:38)
    at org.jsoup.parser.HtmlTreeBuilder.initialiseParse(HtmlTreeBuilder.java:65)
    at org.jsoup.parser.TreeBuilder.parse(TreeBuilder.java:46)
    at org.jsoup.parser.Parser.parseInput(Parser.java:35)
    at org.jsoup.helper.DataUtil.parseInputStream(DataUtil.java:169)
    at org.jsoup.helper.HttpConnection$Response.parse(HttpConnection.java:835)
    at org.jsoup.helper.HttpConnection.get(HttpConnection.java:285)

当我检查网站时，网站的某些部分包含带注释的二进制数据，我认为它导致了问题。我尝试过使用此代码:

Document docs2 = Jsoup.connect("https://www.kiatravels.co.id/group_tour/index?TOUR_ID=1467&ID=15803").ignoreContentType(true).get();

但还是没用。

希望有头脑的代码高手帮忙!

最佳答案

您似乎导航至“下载行程”链接，该链接将打开 pdf 文件。在使用 Jsoup 解析链接之前，您需要检查 url 响应的内容类型。

Connection.Response res = Jsoup.connect(url).execute(); 
String contentType = res.contentType();

您可能想忽略 MIME types不是 text/html。

关于java - JSOUP java.io.IOException : Input is binary and unsupported，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/59676670/

java - JSOUP java.io.IOException : Input is binary and unsupported

上一篇：java - 如何将焦点设置在 editText Preference 上并自动显示键盘？

下一篇：java - 从 yaml 文件中过滤属性