java - 如何使用 HtmlUnit 抓取源代码

标签 java web-scraping htmlunit

我正在尝试用 htmlunit 编写一个程序来从网站上抓取源代码并将其返回。我的代码目前是:

public class Htmlunitscraper { 
  private static String s = "website";

  public static HtmlPage scrapeWebsite() throws IOException {
    final WebClient webClient = new WebClient();
    final HtmlPage page = webClient.getPage(s);

    return page.getPage();
  }
}

我以为 getPage 方法会返回源代码,但我一直遇到错误,只是返回了 url。这些错误是:

Oct 16, 2013 4:07:59 PM com.gargoylesoftware.htmlunit.IncorrectnessListenerImpl notify
WARNING: Obsolete content type encountered: 'application/x-javascript'.
Oct 16, 2013 4:08:00 PM com.gargoylesoftware.htmlunit.javascript.StrictErrorReporter runtimeError
SEVERE: runtimeError: message=[The data necessary to complete this operation is not yet available.] sourceName=[http://cpdocket.cp.cuyahogacounty.us/SheriffSearch/Scripts/jquery.js] line=[2] lineSource=[null] lineOffset=[0]
Oct 16, 2013 4:08:00 PM com.gargoylesoftware.htmlunit.IncorrectnessListenerImpl notify
WARNING: Obsolete content type encountered: 'application/x-javascript'.
Oct 16, 2013 4:08:00 PM com.gargoylesoftware.htmlunit.IncorrectnessListenerImpl notify
WARNING: Obsolete content type encountered: 'application/x-javascript'.
Oct 16, 2013 4:08:00 PM com.gargoylesoftware.htmlunit.IncorrectnessListenerImpl notify
WARNING: Obsolete content type encountered: 'application/x-javascript'.
Oct 16, 2013 4:08:01 PM com.gargoylesoftware.htmlunit.IncorrectnessListenerImpl notify
WARNING: Obsolete content type encountered: 'application/x-javascript'.
Oct 16, 2013 4:08:01 PM com.gargoylesoftware.htmlunit.javascript.StrictErrorReporter runtimeError
SEVERE: runtimeError: message=[The data necessary to complete this operation is not yet available.] sourceName=[http://cpdocket.cp.cuyahogacounty.us/SheriffSearch/ScriptResource.axd?d=0XCJGMnW_16F7h4EC7avEaQ_Ma7RLZvTA2-XkhkFcfSnWFOkCRjbat77Yi12o3uS3yGC-YMdXQ_w3i5MHWALH-xBqxutgCryrSWcT8prtHkRngrJRiKTP-EYEm1QJ6zB0&t=ffffffff823b7694] line=[2] lineSource=[null] lineOffset=[0]
Oct 16, 2013 4:08:01 PM com.gargoylesoftware.htmlunit.IncorrectnessListenerImpl notify
WARNING: Obsolete content type encountered: 'application/x-javascript'.
HtmlPage(http://cpdocket.cp.cuyahogacounty.us/SheriffSearch/results.aspx?q=searchType%3dSaleDate%26searchString%3d10%2f21%2f2013%26foreclosureType%3d%27NONT%27%2c+%27PAR%27%2c+%27COMM%27%2c+%27TXLN%27)@1134201154

我是不是没有使用正确的方法来返回源代码,因为我找不到一个很好的例子来说明如何做到这一点。

最佳答案

您应该通过执行以下操作来查看页面的内容:

System.out.println(page.asXml());

这将以格式良好的方式打印它。

您看到的所有其他内容都是您正在获取的页面中的 javascript 错误。

如果您需要未经格式化的页面源代码,请查看此答案:

检查此答案以关闭这些警告:

关于java - 如何使用 HtmlUnit 抓取源代码,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/19412842/

相关文章:

java - 可以从 android 中的另一个类访问单选按钮的选中状态

java - 无法使用 HtmlUnitDriver 单击 Web 元素

java - 使用 htmlunit 访问 html 表

java - 在顶点上添加鼠标监听器

java - CellTable 中的 GWT TextColumn -> 水平对齐

java - 无法从开始获得STORM NEW VERSION(1.0.1)中的消息

python - 删除python中两个html标签之间的所有数据

r - 通过循环 rvest::follow_link() 函数来抓取链接的 HTML 网页

python - 按数值 Python 对对象的实例进行排序

java - HtmlUnit 无法获取Page