java - 谷歌在Java中搜索

标签 java search

该程序读取搜索查询的文本文件,用它们查询 Google,并将所有链接输出到另一个文件。该程序适用于数百个查询,但突然工作并报告错误。

(我将编辑这篇文章并发布我的程序的哪些行返回的错误)。

有什么想法可能会发生什么吗?

import java.io.*;
import java.net.URL;
import java.net.URLConnection;
import java.util.ArrayList;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import java.util.Scanner;

public class GoogleSearcher {
  public static void main(String [] args) throws Exception {
    Scanner in = new Scanner (System.in);
    System.out.println("Input list of queries to search:");
    String loc = in.nextLine();
    loc = loc.replace("\\", "");
    System.out.println("Where to write file?");
    String writeLoc = in.nextLine();
    writeLoc = writeLoc.replace("\\", " ");
    FileInputStream fstream = new FileInputStream(loc);
    BufferedReader br = new BufferedReader(new InputStreamReader(fstream));
    String line;
    PrintWriter pw = new PrintWriter(new FileWriter(writeLoc + "Google Search Results.txt"));
    while ((line = br.readLine()) != null) {
      System.out.println("Searching: \"" + line + "\"");
      ArrayList<String> t = googleSearch(line);
      if (t != null){
        for (int a = 0; a < t.size(); a++){
          pw.write(t.get(a) + System.lineSeparator());
        }
      }
    }
    br.close();
    pw.close();
  }
  public static ArrayList<String> googleSearch(String search) throws Exception {
    try {
      String query = "https://www.google.com/search?q=" + search.replace(" ", "%20");
      String page = getSearchContent(query);
      ArrayList<String> links = parseLinks(page);
      return formatLinks(links);
    } catch (Exception e) { 
      e.printStackTrace();
      System.out.println("Error... Trying next search");
      return null;
    } 
  }
  public static ArrayList<String> formatLinks(ArrayList a){
    ArrayList<String> formatted = new ArrayList<String>();
    for (int i = 0; i < a.size(); i++){
      String t = (String)a.get(i);
      t = t.replace("%3F", "?");
      t = t.replace("%3D", "=");
      formatted.add(t);
    }
    return formatted;
  }
  public static String getString(InputStream is) {
    StringBuilder sb = new StringBuilder();
    BufferedReader br = new BufferedReader(new InputStreamReader(is));
    String line;
    try {
      while ((line = br.readLine()) != null) {
        sb.append(line);
      }
    } catch (IOException e) {
      e.printStackTrace();
    } finally {
      if (br != null) {
        try {
          br.close();
        } catch (IOException e) {
          e.printStackTrace();
        }
      }
    }
    return sb.toString();
  }
  public static String getSearchContent(String path) throws Exception {
    final String agent = "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)";
    URL url = new URL(path);
    final URLConnection connection = url.openConnection();
    connection.setRequestProperty("User-Agent", agent);
    final InputStream stream = connection.getInputStream();
    return getString(stream);
  }
  public static ArrayList<String> parseLinks(final String html) throws Exception {
    ArrayList<String> result = new ArrayList<String>();
    String pattern1 = "<h3 class=\"r\"><a href=\"/url?q=";
    String pattern2 = "\">";
    Pattern p = Pattern.compile(Pattern.quote(pattern1) + "(.*?)" + Pattern.quote(pattern2));
    Matcher m = p.matcher(html);
    while (m.find()) {
      String domainName = m.group(0).trim();
      // remove unwanted text
      domainName = domainName.substring(domainName.indexOf("/url?q=") + 7);
      domainName = domainName.substring(0, domainName.indexOf("&amp;"));
      result.add(domainName);
    }
    return result;
  }
}

最佳答案

好吧,运行你的程序几轮后,我得到了以下错误。

Error... Trying next search
Searching: "autoradiograph"
java.io.IOException: Server returned HTTP response code: 503 for URL: https://ipv4.google.com/sorry/index?continue=https://www.google.com/search%3Fq%3Daustria&q=EgTLe7ahGOKSrcMFIhkA8aeDSylzciRE9l0cz9fUg6u2MeGh-muxMgNyY24
    at sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1876)
    at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1474)
    at sun.net.www.protocol.https.HttpsURLConnectionImpl.getInputStream(HttpsURLConnectionImpl.java:254)
    at application.GoogleSearcher.getSearchContent(GoogleSearcher.java:90)
    at application.GoogleSearcher.googleSearch(GoogleSearcher.java:45)
    at application.GoogleSearcher.main(GoogleSearcher.java:32)
java.io.IOException: Server returned HTTP response code: 503 for URL: https://ipv4.google.com/sorry/index?continue=https://www.google.com/search%3Fq%3Dautoradiograph&q=EgTLe7ahGOKSrcMFIhkA8aeDS_cQehdQreptc4cInLKEPYpprweeMgNyY24

这种情况正在发生,因为谷歌正在阻止自动搜索以防止 Denial of Service攻击他们的服务器。

Google Captcha Image

Google 可能不允许您执行自动搜索。这是 link to their support page. 。这是该页面的摘录。

Automated queries

Google's Terms of Service do not allow the sending of automated queries of any sort to our system without express permission in advance from Google. Sending automated queries consumes resources and includes using any software (such as WebPosition Gold) to send automated queries to Google to determine how a website or webpage ranks in Google search results for various queries. In addition to rank checking, other types of automated access to Google without permission are also a violation of our Webmaster Guidelines and Terms of Service.

关于java - 谷歌在Java中搜索,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/41437689/

相关文章:

java - 使用多个 map 实现搜索

java - 从 application.conf 加载自定义conf文件时出错

java - 获取java服务器中java客户端的IP地址

Java, Spring : Testing DAOs for DataAccessException with Mockito

c++ - 使用 Boost Multi-Index 搜索多个索引

excel - 从 Excel 中的单元格中获取是、否或什么都没有

java - 如何使用 spring 标签访问与模型 bean 对象关联的 arraylist 的属性?

perl - 在 Perl 中检查一对数字在大 (x,y) 坐标中的成员资格的快速算法

java - 在文件中搜索关键字,然后输出包含关键字的整个句子

javascript - elasticlunr.js 不显示搜索查询结果