在页面https://www.jogossantacasa.pt/web/Placard/placard ,我正在尝试获取 Futebol->...
的链接。我可以,但这只会在 for
循环上刮掉一页。感谢大家。
public class main {
static List<String> links=new ArrayList<>();
static List<String> ligas=new ArrayList<>();
static String url="https://www.jogossantacasa.pt"; //main link
public static void main(String[] args) {
// TODO Auto-generated method stub
Document doc;
// Here i get the links
try {
doc = Jsoup.connect(url+"/web/Placard/placard").get();
Elements a = doc.getElementsByClass("width9");
boolean qwerty = true;
for(Element ele : a) {
Elements k = ele.select("li");
for(Element d : k)
{
String hj = d.select("a").text();
if(hj.contains("Ténis")) qwerty = false;
if(qwerty) {
if(!hj.contains("Futebol")) {
links.add(d.select("a").attr("href"));
ligas.add(hj);
}
}
}
}
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
// Here I try to scrape each country page and error is only the last page is scraped
for(int i = 0 ; i < links.size() ; i++) {
String urlEach=url+links.get(i);
Document docEach;
try {
docEach = Jsoup.connect(urlEach).get();
System.out.println(docEach.toString());
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}
}
最佳答案
第一页 (/web/Placard/eventos?id=23316
) 很大,超过 3MB。 Jsoup 仅下载该文件的前 1MB。为了克服这个限制,设置更高的maxBodySize连接或 0
禁用限制。
docEach = Jsoup.connect(urlEach).maxBodySize(10*1024*1024).get(); // 10MB
关于java - 为什么我无法获取所有页面,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/57183403/