Java 屏幕抓取使用套接字?

标签 java sockets networking screen scrape

我正在尝试从此网站收集 HTML http://movies.about.com/od/actorsalphalist/Actors_Detailed_Movie_News_Interviews_Websites.htm

我打开一个套接字并尝试读取并打印 HTML 页面的每一行。当我运行它时,我只得到“EOF is false”,然后得到“1”作为结果。

我完全不确定出了什么问题,因为我知道这应该在另一个示例中起作用...非常感谢您的帮助!

import java.net.*;
import java.io.*;
import java.util.*;

public class Twitter {

    static final int DEFAULT_PORT = 80;

    protected DataInputStream reply = null;
    protected PrintStream send = null;
    protected Socket sock = null;

    // ***********************************************************
    // *** The constructors create the socket and set up the input
    // *** and output channels on that socket.

    public Twitter() throws UnknownHostException, IOException {
        this(DEFAULT_PORT);
    }

    public Twitter(int port) throws UnknownHostException, IOException {
        sock = new Socket("movies.about.com", port);
        System.out.println(sock);
        reply = new DataInputStream(sock.getInputStream());
        System.out.println();
        send = new PrintStream(sock.getOutputStream());
    }

    // ***********************************************************
    // *** forecast uses the socket that has already been created
    // *** to carry on a conversation with the Web server that it
    // *** has been contacted through the socket.

    public void forecast() {
        int i;
        String HTMLline;
        boolean eof, gotone;

        // *** This issues the same query that a Web browser would issue
        // *** to the Web server.

        try {
            send.println("GET /od/actorsalphalist/Actors_Detailed_Movie_News_Interviews_Websites.htm HTTP/1.1");
        } catch (Exception e) {
            System.out.println("about.com server is down.");
        }

        // *** This section parses the response from the Web server.
        // *** NOTE THAT "real" EOF does not occur until the Web server
        // *** has closed the connection.

        eof = false;
        gotone = false;
        while (!eof) {
            System.out.println("EOF is false");
            try {
                System.out.println("1");
                HTMLline = reply.readLine();
                System.out.println("2");
                System.out.println(HTMLline);
                System.out.println("Here?");
                if (HTMLline != null) {
                    System.out.println("its not null");
                }
                if (HTMLline == null) {
                    System.out.println("WTFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF");
                } else {
                    eof = true;
                    System.out.println("is it?");
                }
            } catch (Exception e) {
                System.out.println("this exception happend");
                e.printStackTrace();
                eof = true;
            }
        }
    }

    // ***********************************************************
    // *** We need to close the socket when this class is destroyed.

    protected void finalize() throws Throwable {
        sock.close();
    }

    // ***********************************************************
    // *** The main program creates a new Twitter class and
    // *** sends that class the command line args (via findNumber).

    public static void main(String[] args) {
        Twitter aboutCom;
        DataInputStream cin = new DataInputStream(System.in);

        try {
            aboutCom = new Twitter();
            aboutCom.forecast();
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
}

最佳答案

您尚未发送有效的 HTTP 请求,因此服务器仍在等待您完成。 GET 行必须以\r\n 结尾,然后您需要另一行作为空行来分隔请求 header 。

但是,您应该为此使用 URL、openConnection()、getInputStream() 等,而不是多余地尝试自己重新实现 HTTP。正如您所做的那样,您所得到的只是有可能犯错。

关于Java 屏幕抓取使用套接字?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/14432262/

相关文章:

java - 排序依据字段,最长的字符串排在前面

java - 在java日志格式化程序中使用简单的类名

java - 使用 TITAN DB 手动安装 gremlin 服务器

c# - 调用套接字的ReceiveAsync()调用后,接收到的数据缓冲区始终为空吗?

networking - 将单个端口用于多个套接字的标准方法?

python - 使用python套接字在两台计算机之间进行通信

http - NodeJS 获取 remoteAddress 使用的 IP

java - 是否有关于使用套接字与 Google App Engine for Java 服务进行客户端/服务器连接的教程?

sockets - Android 中实时多人游戏的 Socket 编程与 WebSocket

c# - 使用c#将文件从一个IP地址发送到另一个IP地址