java - 在一个特定页面上给出 "java.net.SocketTimeoutException: Read timed out"

标签 java web-scraping jsoup

我创建了一个网络抓取,它从页面抓取数据并将其存储在.csv 文件中。我已经使用多个页面执行了该程序,但是当我使用该链接执行程序时,有一个页面在其中的行上给出了 “java.net.SocketTimeoutException:读取超时” 错误我已经创建了 jsoup 库的连接。我不明白为什么它在该特定页面上给出错误。下面提到了我的代码和日志。
注意:我使用 jsoup HTML 解析器、java 1.7、Netbeans。

public class ComOpen_end_fund {

    boolean writeCSVToConsole = true;
    boolean writeCSVToFile = true;
    boolean sortTheList = true;
    boolean writeToConsole;
    boolean writeToFile;
    public static Document doc = null;
    public static Elements tbodyElements = null;
    public static Elements elements = null;
    public static Elements tdElements = null;
    public static Elements trElement2 = null;
    public static String Dcomma = ",";
    public static String line = "";
    public static ArrayList<Elements> sampleList = new ArrayList<Elements>();

    public static void createConnection() throws IOException {
        System.setProperty("http.proxyHost", "191.1.1.202");
        System.setProperty("http.proxyPort", "8080");
        String tempUrl = "http://mufap.com.pk/nav-report.php?tab=01&fname=&amc=&cat=&strdate=&endate=&submitted=&mnt=&yrs=&s=";
        doc = Jsoup.connect(tempUrl).get(); //this is line number 42
    }

    public static void parsingHTML() throws Exception {
        for (Element table : doc.getElementsByTag("table")) {

            for (Element trElement : table.getElementsByTag("tr")) {
                trElement2 = trElement.getElementsByTag("tr");
                tdElements = trElement.getElementsByTag("td");
                File fold = new File("C:\\open-end-fund.csv");
                fold.delete();
                File fnew = new File("C:\\open-end-fund.csv");
                FileWriter sb = new FileWriter(fnew, true);
                if (trElement.hasClass("tab-data")) {
                    for (Iterator<Element> it = tdElements.iterator(); it.hasNext();) {
                        if (it.hasNext()) {
                            sb.append("\r\n");

                        }

                        for (Iterator<Element> it2 = trElement2.iterator(); it.hasNext();) {
                            Element tdElement2 = it.next();
                            final String content = tdElement2.text();
                            if (it2.hasNext()) {

                                sb.append(formatData(content));
                                sb.append("   ,   ");

                            }
                        }

                        System.out.println(sb.toString());
                        sb.flush();
                        sb.close();
                    }
                }
                System.out.println(sampleList.add(tdElements));

            }
        }
    }
    private static final SimpleDateFormat FORMATTER_MMM_d_yyyy = new SimpleDateFormat("MMM d, yyyy", Locale.US);
    private static final SimpleDateFormat FORMATTER_dd_MMM_yyyy = new SimpleDateFormat("dd-MMM-YYYY", Locale.US);

    public static String formatData(String text) {
        String tmp = null;

        try {
            Date d = FORMATTER_MMM_d_yyyy.parse(text);
            tmp = FORMATTER_dd_MMM_yyyy.format(d);
        } catch (ParseException pe) {
            tmp = text;
        }

        return tmp;
    }

    public static void main(String[] args) throws IOException, Exception {
        createConnection(); //this is line number 100
        parsingHTML();

    }

}

这是日志猫

Exception in thread "main" java.net.SocketTimeoutException: Read timed out
    at java.net.SocketInputStream.socketRead0(Native Method)
    at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
    at java.net.SocketInputStream.read(SocketInputStream.java:170)
    at java.net.SocketInputStream.read(SocketInputStream.java:141)
    at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
    at java.io.BufferedInputStream.read1(BufferedInputStream.java:286)
    at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
    at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:704)
    at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:647)
    at sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1536)
    at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1441)
    at java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:480)
    at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:516)
    at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:493)
    at org.jsoup.helper.HttpConnection.execute(HttpConnection.java:205)
    at org.jsoup.helper.HttpConnection.get(HttpConnection.java:194)
    at com.open_end_fund.ComOpen_end_fund.createConnection(ComOpen_end_fund.java:42)
    at com.open_end_fund.ComOpen_end_fund.main(ComOpen_end_fund.java:100)
C:\Users\talha\AppData\Local\NetBeans\Cache\8.1\executor-snippets\run.xml:53: Java returned: 1
BUILD FAILED (total time: 3 seconds)

当我在 http://www.mufap.com.pk/nav_returns_performance.php?tab=01 上运行此代码时
此链接工作正常。

最佳答案

您可以尝试增加超时:

Jsoup.connect(url).timeout(30000).get();

这会将超时设置为 30 秒。默认值为 3 秒。如果将其设置为 0,它将表现为无限超时。

https://jsoup.org/apidocs/org/jsoup/Connection.html#timeout-int-

关于java - 在一个特定页面上给出 "java.net.SocketTimeoutException: Read timed out",我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/36862665/

相关文章:

python - 获取 XPATH 和 CSS 选择器以使用 Selenium 进行抓取的最佳方法

python - 在Python3中,如何使用.append函数将字符串添加到抓取的链接中?

java - 删除链接 jsoup 中的脚本

java - 尝试在 Eclipse 中将程序作为类运行 × 259989

java - 在 HashMap 中维护顺序

python - Scrapy如何使用代理池

javascript - 使用jsoup Android登录后如何获取javascript变量?

java - 在第一级提取 jsoup 中的元素,无递归

java - 在 Oracle Application Server 下的 Eclipse 中调试 Web 应用程序

java - 如何在 Java 中排列通用列表?