我的目标是连接到一个 OWA 页面(Microsoft Office Outlook Web Access - 基本上是一个电子邮件客户端)并登录,然后读取加载的新页面并找到收件箱计数。
要登录,我需要填写用户名和密码字段并调用我知道名称和 header 的某个 javascript 函数。
我如何:
- 获取页面的 DOM?
- 更新 DOM 以填写输入文本字段?
- 调用那个 Javascript 函数?
- 获取我重定向到的页面的新 URL?
到目前为止,我能够使用以下 Java 代码连接到网页并加载其页面源:
// open the connection to the welcome page
callback.status("Opening connection...");
URLConnection connection = null;
try
{
connection = url.openConnection();
}
catch(IOException ex)
{
throw new Exception("I/O Problem while attempting URL connection");
}
connection.setDoInput(true);
// open input stream to read website
callback.status("Opening data stream...");
InputStream input = null;
try
{
input = connection.getInputStream();
}
catch(IOException ex)
{
throw new Exception("I/O Problem while opening data stream");
}
// read website contents
callback.status("Reading site...");
String content = "";
byte[] buffer = new byte[100];
int totalBytesRead = 0;
int bytesRead = 0;
try
{
while((bytesRead = input.read(buffer)) != -1)
{
String newContent = new String(buffer, 0, bytesRead);
content += newContent;
}
}
catch(IOException ex)
{
throw new Exception("I/O Problem while reading website");
}
System.out.println(content);
结果是将整个页面源输出到控制台 - 太棒了。 我还尝试解析页面以获取 DOM 对象,然后我可以按照该对象找到我的用户名和密码字段:
XMLParserConfiguration config = new XML11DTDConfiguration();
DOMParser parser = new DOMParser(config);
InputSource inputSource = new InputSource(input);
inputSource.setByteStream(input);
try
{
parser.parse(inputSource);
}
catch(SAXParseException ex)
{
}
Document document = parser.getDocument();
visitNode(document, 0);
但我收到 SAXParseException: :6:62: publicId 和 systemId 之间需要空格。
看起来这一行是罪魁祸首:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
所以我可能需要以某种方式更改 DOMParser 的配置以足够宽松并“原谅”空白要求。
最佳答案
所以你想以编程方式表现得像一个无 GUI 的网络浏览器?使用 HtmlUnit ,这正是它宣传自己的内容。
HtmlUnit is a "GUI-Less browser for Java programs". It models HTML documents and provides an API that allows you to invoke pages, fill out forms, click links, etc... just like you do in your "normal" browser.
It has fairly good JavaScript support (which is constantly improving) and is able to work even with quite complex AJAX libraries, simulating either Firefox or Internet Explorer depending on the configuration you want to use.
It is typically used for testing purposes or to retrieve information from web sites.
另见:
关于java - 从 Java 调用网页上的 Javascript,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/3282764/