我是 Jsoup 的新手,但我不明白为什么在尝试获取页面时收到 404 错误,即使该页面可以从浏览器访问并且我没有使用任何代理。我试过以下代码:
private static Document connect() {
String url = "http://www.transfermarkt.co.uk/real-madrid/startseite/verein/418";
Document doc = null;
try {
doc = Jsoup.connect(url).get();
} catch (NullPointerException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (HttpStatusException e) {
e.printStackTrace();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
return doc;
}
我收到异常消息:
org.jsoup.HttpStatusException: HTTP error fetching URL. Status=404, URL=http://www.transfermarkt.co.uk/real-madrid/startseite/verein/418
at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:449)
at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:424)
at org.jsoup.helper.HttpConnection.execute(HttpConnection.java:178)
at org.jsoup.helper.HttpConnection.get(HttpConnection.java:167)
at ro.pago.ucl2015.UCLWebParser.connect(UCLWebParser.java:27)
at ro.pago.ucl2015.UCLWebParser.main(UCLWebParser.java:16)
最佳答案
该网站似乎不允许机器人,它会抛出一个 404 错误响应,以防找不到 User-Agent header 。 下面的工作是因为它设置了用户代理 header
private static Document connect() {
String url = "http://www.transfermarkt.co.uk/real-madrid/startseite/verein/418";
Document doc = null;
try {
doc = Jsoup.connect(url)
.userAgent("Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:25.0) Gecko/20100101 Firefox/25.0")
.referrer("http://www.google.com")
.get();
} catch (NullPointerException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (HttpStatusException e) {
e.printStackTrace();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
return doc;
}
User Agent
The Hypertext Transfer Protocol (HTTP) identifies the client software originating the request, using a "User-Agent" header, even when the client is not operated by a user.
Referrer (I don't think this is necessary)
HTTP referer (originally a misspelling of referrer) is an HTTP header field that identifies the address of the webpage (i.e. the URI or IRI) that linked to the resource being requested.
为了提供完整的服务,我建议您为您的请求设置超时期限。默认为 3 秒,如果服务器花费的时间比您收到的时间长 一个异常(exception)。 Bellow 使用超时 setter 跟随您的代码。在尽可能长的时间内将其设置为零。
private static Document connect() {
String url = "http://www.transfermarkt.co.uk/real-madrid/startseite/verein/418";
Document doc = null;
try {
doc = Jsoup.connect(url)
.userAgent("Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:25.0) Gecko/20100101 Firefox/25.0")
.referrer("http://www.google.com")
.timeout(1000*5) //it's in milliseconds, so this means 5 seconds.
.get();
} catch (NullPointerException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (HttpStatusException e) {
e.printStackTrace();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
return doc;
}
关于java - Jsoup 404错误,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/24475816/