javascript - 通过 JavaScript 重定向进行爬网

标签 javascript java web-crawler redirect

我正在用 Java 编写一个蜘蛛程序,在处理 URL 重定向时遇到了一些麻烦。到目前为止,我遇到过两种 URL 重定向,第一种是 HTTP 响应代码为 3xx 的 URL 重定向,我可以按照 this answer 处理。 。

但第二种是服务器返回 HTTP 响应代码 200,页面中只包含一些 JavaScript 代码,如下所示:

<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<script>
function detectmob() { 
    var u=(document.URL);
    if( navigator.userAgent.match(/Android/i) || some other browser...){
        window.location.href="web/mobile/index.php";
    } else {
        window.location.href="web/desktop/index.php";
    }
}

detectmob();
</script>
</head>
<body></body></html>

如果原始URL是http://example.com ,那么它会自动重定向到http://example.com/web/desktop/index.php如果我使用启用了 JavaScript 的桌面 Web 浏览器。

但是,我的蜘蛛检查 HttpURLConnection#getResponseCode()通过获取 HTTP 响应代码 200 并使用 URLConnection#getHeaderField() 来查看它是否已到达最终 URL如果收到 HTTP 响应代码 3xx,则获取 Location 字段。以下是我的蜘蛛的代码片段:

public String getFinalUrl(String originalUrl) {
        try {
            URLConnection con = new URL(originalUrl).openConnection();
            HttpURLConnection hCon = (HttpURLConnection) con;
            hCon.setInstanceFollowRedirects(false);
            if(hCon.getResponseCode() == HttpURLConnection.HTTP_MOVED_PERM 
                    || hCon.getResponseCode() == HttpURLConnection.HTTP_MOVED_TEMP) {
                System.out.println("redirected url: " + con.getHeaderField("Location"));
                return getFinalUrl(con.getHeaderField("Location"));
            }
        } catch (IOException ex) {
            System.err.println(ex.toString());
        }

        return originalUrl;
    }

因此,获取上述页面将有一个 HTTP 响应代码 200,我的蜘蛛将假设不会有进一步的重定向,并开始解析内容文本为空的页面。

我用谷歌搜索了一下这个问题,显然 javax.script有点相关,但我不知道如何让它发挥作用。如何对我的蜘蛛进行编程,使其能够获取正确的 URL?

最佳答案

这里是一个解决方案,它使用 Apache HttpClient 处理响应代码重定向,使用 Jsoup 从 html 中提取 javascript,然后使用正则表达式从几种可以在 javascript 中执行重定向的方式获取重定向字符串。

package com.yourpackage;

import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.io.StringWriter;
import java.net.MalformedURLException;
import java.net.URL;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

import org.apache.http.HttpResponse;
import org.apache.http.client.HttpClient;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.impl.client.HttpClientBuilder;
import org.jsoup.Jsoup;
import org.jsoup.helper.StringUtil;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;

import com.google.common.base.Joiner;
import com.google.common.net.HttpHeaders;

public class CrawlHelper {

  /**
   * Get end contents of a urlString. Status code is not checked here because
   * org.apache.http.client.HttpClient effectively handles the 301 redirects.
   * 
   * Javascript is extracted using Jsoup, and checked for references to
   * &quot;window.location.replace&quot;.
   * 
   * @param urlString Url. &quot;http&quot; will be prepended if https or http not already there.
   * @return Result after all redirects, including javascript.
   * @throws IOException
   */
  public String getResult(final String urlString) throws IOException {
    String html = getTextFromUrl(urlString);
    Document doc = Jsoup.parse(html);
    for (Element script : doc.select("script")) {
      String potentialURL = getTargetLocationFromScript(urlString, script.html());
      if (potentialURL.indexOf("/") == 0) {
        potentialURL = Joiner.on("").join(urlString, potentialURL);
      }
      if (!StringUtil.isBlank(potentialURL)) {
        return getTextFromUrl(potentialURL);
      }
    }
    return html;
  }

  /**
   * 
   * @param urlString Will be prepended if the target location doesn't start with &quot;http&quot;.
   * @param js Javascript to scan.
   * @return Target that matches window.location.replace or window.location.href assignments.
   * @throws IOException
   */
  String getTargetLocationFromScript(String urlString, String js) throws IOException {
    String potentialURL = getTargetLocationFromScript(js);
    if (potentialURL.indexOf("http") == 0) {
      return potentialURL;
    }
    return Joiner.on("").join(urlString, potentialURL);
  }

  String getTargetLocationFromScript(String js) throws IOException {
    int i = js.indexOf("window.location.replace");
    if (i > -1) {
      return getTargetLocationFromLocationReplace(js);
    }
    i = js.indexOf("window.location.href");    
    if (i > -1) {
      return getTargetLocationFromHrefAssign(js);
    }
    return "";
  }

  private String getTargetLocationFromHrefAssign(String js) {
    return findTargetFrom("window.location.href\\s?=\\s?\\\"(.+)\\\"", js);
  }

  private String getTargetLocationFromLocationReplace(String js) throws IOException {
    return findTargetFrom("window.location.replace\\(\\\"(.+)\\\"\\)", js);
  }

  private String findTargetFrom(String regex, String js) {
    Pattern p = Pattern.compile(regex);
    Matcher m = p.matcher(js);
    while (m.find()) {
      String potentialURL = m.group(1);
      if (!StringUtil.isBlank(potentialURL)) {
        return potentialURL;
      }
    }
    return "";
  }

  private String getTextFromUrl(String urlString) throws IOException {
    if (StringUtil.isBlank(urlString)) {
      throw new IOException("Supplied URL value is empty.");
    }
    String httpUrlString = prependHTTPifNecessary(urlString);
    HttpClient client = HttpClientBuilder.create().build();
    HttpGet request = new HttpGet(httpUrlString);
    request.addHeader("User-Agent", HttpHeaders.USER_AGENT);
    HttpResponse response = client.execute(request);
    try (BufferedReader rd =
        new BufferedReader(new InputStreamReader(response.getEntity().getContent()))) {
      StringWriter result = new StringWriter();
      String line = "";
      while ((line = rd.readLine()) != null) {
        result.append(line);
      }
      return result.toString();
    }
  }

  private String prependHTTPifNecessary(String urlString) throws IOException {
    if (urlString.indexOf("http") != 0) {
      return Joiner.on("://").join("http", urlString);
    }
    return validateURL(urlString);
  }

  private String validateURL(String urlString) throws IOException {
    try {
      new URL(urlString);
    } catch (MalformedURLException mue) {
      throw new IOException(mue);
    }
    return urlString;
  }
}

TDD...修改/增强以匹配各种场景:

package com.yourpackage;

import java.io.IOException;

import org.junit.Assert;
import org.junit.Test;

public class CrawlHelperTest {

  @Test
  public void testRegex() throws IOException {
    String targetLoc = 
    new CrawlHelper().getTargetLocationFromScript("somesite.com", "function goHome() { window.location.replace(\"/s/index.html\")}");
    Assert.assertEquals("somesite.com/s/index.html", targetLoc);
    targetLoc = 
        new CrawlHelper().getTargetLocationFromScript("window.location.href=\"web/mobile/index.php\";");
    Assert.assertEquals("web/mobile/index.php", targetLoc);
  }

  @Test
  public void testCrawl() throws IOException {
    Assert.assertTrue(new CrawlHelper().getResult("somesite.com").indexOf("someExpectedContent") > -1);
  }

}

关于javascript - 通过 JavaScript 重定向进行爬网,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/43262723/

相关文章:

java - 你能重写一个父类(super class)的方法并且……

java - 如何在 JNI 中查找二维数组大小

vba - VBA 的网络爬虫

javascript - 如何在纯JavaScript中通过值获取数组的多个索引(值精确匹配)

javascript - 在 HTML 中包含 js 文件

javascript - 在移动设备上使用 ID 获取 Youtube 和 Vimeo 缩略图

java.lang.reflect.invocacytargetException 在 Junit 中导致 null

python - 需要指导将 python 脚本中的数据插入 MySQL 数据库

java - Lucene有什么用?

javascript - 将字符串转换为表情符号