java - 如何使用java从存储在我的计算机上的html文件中提取url？

我需要找到存储在我的计算机本身中的 html 文件中存在的所有 url，并提取链接并将其存储到变量中。我正在使用下面的代码来扫描文件并获取行。但我很难仅提取链接。如果有人能帮助我，我将不胜感激。

    Scanner htmlScanner = new Scanner(new File(args[0]));
    PrintWriter output = new PrintWriter(new FileWriter(args[1]));
    while(htmlScanner.hasNext()){
        output.print(htmlScanner.next());

    }
    System.out.println("\nDone");
    htmlScanner.close();
    output.close();

最佳答案

您实际上可以使用 Swing HTML 解析器来完成此操作。虽然 Swing 解析器只理解 HTML 3.2，但在更高版本的 HTML 中引入的标签将被简单地视为未知，而您真正想要的只是链接。

static Collection<String> getLinks(Path file)
throws IOException,
       MimeTypeParseException,
       BadLocationException {

    HTMLEditorKit htmlKit = new HTMLEditorKit();

    HTMLDocument htmlDoc;
    try {
        htmlDoc = (HTMLDocument) htmlKit.createDefaultDocument();
        try (Reader reader =
            Files.newBufferedReader(file, StandardCharsets.ISO_8859_1)) {

            htmlKit.read(reader, htmlDoc, 0);
        }
    } catch (ChangedCharSetException e) {
        MimeType mimeType = new MimeType(e.getCharSetSpec());
        String charset = mimeType.getParameter("charset");

        htmlDoc = (HTMLDocument) htmlKit.createDefaultDocument();
        htmlDoc.putProperty("IgnoreCharsetDirective", true);
        try (Reader reader =
            Files.newBufferedReader(file, Charset.forName(charset))) {

            htmlKit.read(reader, htmlDoc, 0);
        }
    }

    Collection<String> links = new ArrayList<>();

    for (HTML.Tag tag : Arrays.asList(HTML.Tag.LINK, HTML.Tag.A)) {
        HTMLDocument.Iterator it = htmlDoc.getIterator(tag);
        while (it.isValid()) {
            String link = (String)
                it.getAttributes().getAttribute(HTML.Attribute.HREF);

            if (link != null) {
                links.add(link);
            }

            it.next();
        }
    }

    return links;
}

关于java - 如何使用java从存储在我的计算机上的html文件中提取url？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/23372009/

java - 如何使用java从存储在我的计算机上的html文件中提取url？

上一篇：java - Spring内容协商/OpenCSV : Getting a blank CSV

下一篇：java - 语音语音合成器中的女性声音输出