java正则表达式从文本中检索链接

标签 java regex string url text

我的输入String为:

String text = "Some content which contains link as <A HREF=\"/relative-path/fruit.cgi?param1=abc&param2=xyz\">URL Label</A> and some text after it";

我想将此文本转换为:

Some content which contains link as http://www.google.com/relative-path/fruit.cgi?param1=abc&param2=xyz&myParam=pqr (URL Label) and some text after it

所以这里:

1) 我想用普通链接替换链接标签。如果标签包含标签,那么它应该放在 URL 后面的大括号中。

2) 如果 URL 是相对的,我想为基本 URL 添加前缀 ( http://www.google.com )。

3) 我想在 URL 中附加一个参数。 (&myParam=pqr)

我在检索包含 URL 和标签的标签并替换它时遇到问题。

我写了这样的内容:

public static void main(String[] args) {
    String text = "String text = "Some content which contains link as <A HREF=\"/relative-path/fruit.cgi?param1=abc&param2=xyz\">URL Label</A> and some text after it";";
    text = text.replaceAll("&lt;", "<");
    text = text.replaceAll("&gt;", ">");
    text = text.replaceAll("&amp;", "&");

    // this is not working
    Pattern p = Pattern.compile("href=\"(.*?)\"");
    Matcher m = p.matcher(text);
    String url = null;
    if (m.find()) {
        url = m.group(1);

    }
}

// helper method to append new query params once I have the url
public static URI appendQueryParams(String uriToUpdate, String queryParamsToAppend) throws URISyntaxException {
    URI oldUri = new URI(uriToUpdate);
    String newQueryParams = oldUri.getQuery();
    if (newQueryParams == null) {
        newQueryParams = queryParamsToAppend;
    } else {
        newQueryParams += "&" + queryParamsToAppend;  
    }
    URI newUri = new URI(oldUri.getScheme(), oldUri.getAuthority(),
            oldUri.getPath(), newQueryParams, oldUri.getFragment());
    return newUri;
}

编辑1:

Pattern p = Pattern.compile("HREF=\"(.*?)\"");

这有效。但我希望它与大小写无关。 Href、HRef、href、hrEF 等都应该可以工作。

此外,如果我的文本有多个 URL,我该如何处理。

编辑2:

一些进展。

Pattern p = Pattern.compile("href=\"(.*?)\"");
Matcher m = p.matcher(text);
String url = null;
while (m.find()) {
  url = m.group(1);
  System.out.println(url);
}

这可以处理多个 URL 的情况。

最后一个悬而未决的问题是,如何获取标签并将原始文本中的 href 标签替换为 URL 和标签。

编辑3:

通过多个 URL 情况,我的意思是给定文本中存在多个 URL。

String text = "Some content which contains link as &lt;A HREF=\"/relative-path/fruit.cgi?param1=abc&amp;param2=xyz\"&gt;URL Label&lt;/A&gt; and some text after it and another link &lt;A HREF=\"/relative-path/vegetables.cgi?param1=abc&amp;param2=xyz\"&gt;URL2 Label&lt;/A&gt; and some more text";

Pattern p = Pattern.compile("href=\"(.*?)\"", Pattern.CASE_INSENSITIVE);
Matcher m = p.matcher(text);
String url = null;
while (m.find()) {
 url = m.group(1); // this variable should contain the link URL
 url = appendBaseURI(url);
 url = appendQueryParams(url, "license=ABCXYZ");
 System.out.println(url);
}

最佳答案

public static void main(String args[]) {
    String text = "Some content which contains link as &lt;A HREF=\"/relative-path/fruit.cgi?param1=abc&amp;param2=xyz\"&gt;URL Label&lt;/A&gt; and some text after it and another link &lt;A HREF=\"/relative-path/vegetables.cgi?param1=abc&amp;param2=xyz\"&gt;URL2 Label&lt;/A&gt; and some more text";
    text = StringEscapeUtils.unescapeHtml4(text);
    Pattern p = Pattern.compile("<a href=\"(.*?)\">(.*?)</a>", Pattern.CASE_INSENSITIVE);
    Matcher m = p.matcher(text);
    while (m.find()) {
        text = text.replace(m.group(0), cleanUrlPart(m.group(1), m.group(2)));
    }
    System.out.println(text);
}

private static String cleanUrlPart(String url, String label) {
    if (!url.startsWith("http") && !url.startsWith("www")) {
        if (url.startsWith("/")) {
            url = "http://www.google.com" + url;
        } else {
            url = "http://www.google.com/" + url;
        }
    }
    url = appendQueryParams(url, "myParam=pqr").toString();
    if (label != null && !label.isEmpty()) url += " (" + label + ")";
    return url;
}

输出

Some content which contains link as http://www.google.com/relative-path/fruit.cgi?param1=abc&param2=xyz&myParam=pqr (URL Label) and some text after it and another link http://www.google.com/relative-path/vegetables.cgi?param1=abc&param2=xyz&myParam=pqr (URL2 Label) and some more text

关于java正则表达式从文本中检索链接,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/53423132/

相关文章:

java - 如何从打包在war文件中的jar文件中加载资源?

java - 使用 xpath 操作 json 文件的有效方法?

Java .charAt(i) 比较问题

Javascript 正则表达式 : remove first and last slash

regex - 如何在perl中将字符串与变音符号匹配?

javascript - 正则表达式 : Check for digits and whitespaces

c++ - 扫描字符串每个字符的ASCII值

java - 如何访问输出阶段的 Mapper/Reducer 计数器?

windows - 使用批处理脚本进行字符串操作

c++ - wchar_t 字符串数组的成员丢失