java - Jsoup Reddit 图像抓取器超过 18 期

标签 java web-scraping jsoup reddit

我正在开发一个图像抓取工具,它使用 JSOUP 抓取各个 Reddit 子版 block 的第一页。然而,出现的问题是,当尝试抓取 NSFW subreddit 时,reddit 会重定向到超过 18 的身份验证页面,而抓取工具会抓取身份验证页面。我是抓取新手,知道这是一个菜鸟问题,但任何帮助将不胜感激,因为我完全迷失了。

import java.io.IOException;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

import java.io.BufferedReader;
import java.io.BufferedWriter;
import java.io.File;
import java.io.FileWriter;
import java.io.IOException;
import java.io.InputStreamReader;
import java.net.MalformedURLException;
import java.net.URL;
import java.net.URLConnection;

import java.io.*;
import java.net.URL;
import java.util.logging.Level;
import java.util.logging.Logger;
import java.io.*;
import java.util.logging.Level;
import java.util.logging.Logger;
import org.jsoup.Jsoup;

import org.jsoup.nodes.Attributes;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import java.io.IOException;
import java.net.URL;
import java.util.Scanner;

public class javascraper{
    public static final String USER_AGENT = "<User-Agent: github.com/dabeermasood:v1.2.3 (by /u/swedenotswiss)>";


public static void main (String[]args) throws MalformedURLException
{
    Scanner scan = new Scanner (System.in);
    System.out.println("Where do you want to store the files?");
    String folderpath = scan.next();
    System.out.println("What subreddit do you want to scrape?");
    String subreddit = scan.next();
    subreddit = ("http://reddit.com/r/" + subreddit);

    new File(folderpath + "/" + subreddit).mkdir();


//test

try{
    //gets http protocol
    Document doc = Jsoup.connect(subreddit).userAgent(USER_AGENT).timeout(0).get();

//get page title
String title = doc.title();
System.out.println("title : " + title);

//get all links
Elements links = doc.select("a[href]");

for(Element link : links){

//get value from href attribute
String checkLink = link.attr("href");
Elements images = doc.select("img[src~=(?i)\\.(png|jpe?g|gif)]");
if (imgCheck(checkLink)){ // checks to see if img link j
    System.out.println("link : " + link.attr("href"));
downloadImages(checkLink, folderpath);





}
}



}
catch (IOException e){
e.printStackTrace();
}

}


public static boolean imgCheck(String http){
String png = ".png";
String jpg = ".jpg";
String jpeg = "jpeg"; // no period so checker will only check last four characaters
String gif = ".gif";
int length = http.length();

if (http.contains(png)|| http.contains("gfycat") || http.contains(jpg)|| http.contains(jpeg) || http.contains(gif)){
return true;
}
else{
return false;
}



}



private static void downloadImages(String src, String folderpath) throws IOException{
String folder = null;

        //Exctract the name of the image from the src attribute

        int indexname = src.lastIndexOf("/");

        if (indexname == src.length()) {

            src = src.substring(1, indexname);

        }
 indexname = src.lastIndexOf("/");

        String name = src.substring(indexname, src.length());

        System.out.println(name);

        //Open a URL Stream

        URLConnection connection = (new URL(src)).openConnection();

        try {
            Thread.sleep(2000);
        } catch (InterruptedException e) {

            e.printStackTrace();
        } //Delay to comply with rate limiting
        connection.setRequestProperty("User-Agent", USER_AGENT);

        InputStream in = connection.getInputStream();

        OutputStream out = new BufferedOutputStream(new FileOutputStream( folderpath+ name));



        for (int b; (b = in.read()) != -1;) {

            out.write(b);

        }

        out.close();

        in.close();






}



}

最佳答案

我已经在 this link 中发布了使用 Jsoup 对服务器进行身份验证的答案。 。基本上,您需要使用以下方式将您的登录 ID 和密码以及其他所需数据POST 到服务器:

Connection.Response res = Jsoup.connect(url).data(...).method(Method.Post).execute();,然后将服务器的响应cookie保存到保持您的 session 经过身份验证。

关于java - Jsoup Reddit 图像抓取器超过 18 期,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/33738867/

相关文章:

java - 如何使用 gradle 强制执行 java 编译器版本?

javascript - Node js console.log 没有显示任何内容

android - 如何用jsoup解析lu、li标签?

java - JSOUP从同名的div中获取div内容

java - 为什么 Jsoup 不能选择 td 元素?

java - Utgard - 访问被拒绝

java - struts2应用程序未运行

Java NTP 实现

java - 如何在 Android 中执行网页抓取?

javascript - 无法将函数的结果传递到 node.js 中的全局范围