javascript - 通过 JavaScript 重定向进行爬网

我正在用 Java 编写一个蜘蛛程序，在处理 URL 重定向时遇到了一些麻烦。到目前为止，我遇到过两种 URL 重定向，第一种是 HTTP 响应代码为 3xx 的 URL 重定向，我可以按照 this answer 处理。。

但第二种是服务器返回 HTTP 响应代码 200，页面中只包含一些 JavaScript 代码，如下所示:

<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<script>
function detectmob() { 
    var u=(document.URL);
    if( navigator.userAgent.match(/Android/i) || some other browser...){
        window.location.href="web/mobile/index.php";
    } else {
        window.location.href="web/desktop/index.php";
    }
}

detectmob();
</script>
</head>
<body></body></html>

如果原始URL是http://example.com ，那么它会自动重定向到http://example.com/web/desktop/index.php如果我使用启用了 JavaScript 的桌面 Web 浏览器。

但是，我的蜘蛛检查 HttpURLConnection#getResponseCode()通过获取 HTTP 响应代码 200 并使用 URLConnection#getHeaderField() 来查看它是否已到达最终 URL如果收到 HTTP 响应代码 3xx，则获取 Location 字段。以下是我的蜘蛛的代码片段:

public String getFinalUrl(String originalUrl) {
        try {
            URLConnection con = new URL(originalUrl).openConnection();
            HttpURLConnection hCon = (HttpURLConnection) con;
            hCon.setInstanceFollowRedirects(false);
            if(hCon.getResponseCode() == HttpURLConnection.HTTP_MOVED_PERM 
                    || hCon.getResponseCode() == HttpURLConnection.HTTP_MOVED_TEMP) {
                System.out.println("redirected url: " + con.getHeaderField("Location"));
                return getFinalUrl(con.getHeaderField("Location"));
            }
        } catch (IOException ex) {
            System.err.println(ex.toString());
        }

        return originalUrl;
    }

因此，获取上述页面将有一个 HTTP 响应代码 200，我的蜘蛛将假设不会有进一步的重定向，并开始解析内容文本为空的页面。

我用谷歌搜索了一下这个问题，显然 javax.script有点相关，但我不知道如何让它发挥作用。如何对我的蜘蛛进行编程，使其能够获取正确的 URL？

最佳答案

这里是一个解决方案，它使用 Apache HttpClient 处理响应代码重定向，使用 Jsoup 从 html 中提取 javascript，然后使用正则表达式从几种可以在 javascript 中执行重定向的方式获取重定向字符串。

package com.yourpackage;

import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.io.StringWriter;
import java.net.MalformedURLException;
import java.net.URL;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

import org.apache.http.HttpResponse;
import org.apache.http.client.HttpClient;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.impl.client.HttpClientBuilder;
import org.jsoup.Jsoup;
import org.jsoup.helper.StringUtil;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;

import com.google.common.base.Joiner;
import com.google.common.net.HttpHeaders;

public class CrawlHelper {

  /**
   * Get end contents of a urlString. Status code is not checked here because
   * org.apache.http.client.HttpClient effectively handles the 301 redirects.
   * 
   * Javascript is extracted using Jsoup, and checked for references to
   * &quot;window.location.replace&quot;.
   * 
   * @param urlString Url. &quot;http&quot; will be prepended if https or http not already there.
   * @return Result after all redirects, including javascript.
   * @throws IOException
   */
  public String getResult(final String urlString) throws IOException {
    String html = getTextFromUrl(urlString);
    Document doc = Jsoup.parse(html);
    for (Element script : doc.select("script")) {
      String potentialURL = getTargetLocationFromScript(urlString, script.html());
      if (potentialURL.indexOf("/") == 0) {
        potentialURL = Joiner.on("").join(urlString, potentialURL);
      }
      if (!StringUtil.isBlank(potentialURL)) {
        return getTextFromUrl(potentialURL);
      }
    }
    return html;
  }

  /**
   * 
   * @param urlString Will be prepended if the target location doesn't start with &quot;http&quot;.
   * @param js Javascript to scan.
   * @return Target that matches window.location.replace or window.location.href assignments.
   * @throws IOException
   */
  String getTargetLocationFromScript(String urlString, String js) throws IOException {
    String potentialURL = getTargetLocationFromScript(js);
    if (potentialURL.indexOf("http") == 0) {
      return potentialURL;
    }
    return Joiner.on("").join(urlString, potentialURL);
  }

  String getTargetLocationFromScript(String js) throws IOException {
    int i = js.indexOf("window.location.replace");
    if (i > -1) {
      return getTargetLocationFromLocationReplace(js);
    }
    i = js.indexOf("window.location.href");    
    if (i > -1) {
      return getTargetLocationFromHrefAssign(js);
    }
    return "";
  }

  private String getTargetLocationFromHrefAssign(String js) {
    return findTargetFrom("window.location.href\\s?=\\s?\\\"(.+)\\\"", js);
  }

  private String getTargetLocationFromLocationReplace(String js) throws IOException {
    return findTargetFrom("window.location.replace\\(\\\"(.+)\\\"\\)", js);
  }

  private String findTargetFrom(String regex, String js) {
    Pattern p = Pattern.compile(regex);
    Matcher m = p.matcher(js);
    while (m.find()) {
      String potentialURL = m.group(1);
      if (!StringUtil.isBlank(potentialURL)) {
        return potentialURL;
      }
    }
    return "";
  }

  private String getTextFromUrl(String urlString) throws IOException {
    if (StringUtil.isBlank(urlString)) {
      throw new IOException("Supplied URL value is empty.");
    }
    String httpUrlString = prependHTTPifNecessary(urlString);
    HttpClient client = HttpClientBuilder.create().build();
    HttpGet request = new HttpGet(httpUrlString);
    request.addHeader("User-Agent", HttpHeaders.USER_AGENT);
    HttpResponse response = client.execute(request);
    try (BufferedReader rd =
        new BufferedReader(new InputStreamReader(response.getEntity().getContent()))) {
      StringWriter result = new StringWriter();
      String line = "";
      while ((line = rd.readLine()) != null) {
        result.append(line);
      }
      return result.toString();
    }
  }

  private String prependHTTPifNecessary(String urlString) throws IOException {
    if (urlString.indexOf("http") != 0) {
      return Joiner.on("://").join("http", urlString);
    }
    return validateURL(urlString);
  }

  private String validateURL(String urlString) throws IOException {
    try {
      new URL(urlString);
    } catch (MalformedURLException mue) {
      throw new IOException(mue);
    }
    return urlString;
  }
}

TDD...修改/增强以匹配各种场景:

package com.yourpackage;

import java.io.IOException;

import org.junit.Assert;
import org.junit.Test;

public class CrawlHelperTest {

  @Test
  public void testRegex() throws IOException {
    String targetLoc = 
    new CrawlHelper().getTargetLocationFromScript("somesite.com", "function goHome() { window.location.replace(\"/s/index.html\")}");
    Assert.assertEquals("somesite.com/s/index.html", targetLoc);
    targetLoc = 
        new CrawlHelper().getTargetLocationFromScript("window.location.href=\"web/mobile/index.php\";");
    Assert.assertEquals("web/mobile/index.php", targetLoc);
  }

  @Test
  public void testCrawl() throws IOException {
    Assert.assertTrue(new CrawlHelper().getResult("somesite.com").indexOf("someExpectedContent") > -1);
  }

}

关于javascript - 通过 JavaScript 重定向进行爬网，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/43262723/

javascript - 通过 JavaScript 重定向进行爬网

上一篇：java - 我的异常处理中出现无限循环？

下一篇：java - 从循环中的扫描仪获取异常