javascript - 可以从 URL 中提取所有子域 + 域并与 RFC 3490 兼容的正则表达式

我正在寻找可以从 URL 中提取所有子域 + 域的正则表达式。

我已经从 here 找到了这个:

/([a-z0-9|-]+\.)*[a-z0-9|-]+\.[a-z]+/

它能够提取子域 + 域，但不幸的是，它不关心子域/域前面的 -，也不支持非 ASCII 字符，如 RFC 3490 中指定的那样

这里有一些我想捕获的例子:

http://www.例如.中国/
http://www.würstchen.mit.käsebrötchen.de:8080/news/index.html
https://www.fußballspiel.de/
http://www.simulateur-prêt.fr

最佳答案

我整理了以下正则表达式，我也对其进行了大量评论，希望能更好地描述正在发生的事情。它匹配所有 ASCII 和非 ASCII 字符，并成功地从您的示例中提取所需信息。

正则表达式示例:

const regexp = new RegExp(
  "^" +                     // Ensures a match is found only if it starts at the beginning of a string.
  "(?:^\\w+:\\/\\/)?" +     // Matches a protocol at the beginning of the string, which is optional.
  "(" +                     // The beginning of our capture group.
  "(?:" +                   // The beginning of our sub-domain non-capturing group.
  "(?!-)" +                 // Skips the match if a sub-domain begins with a hyphen.
  "[\\w-]+" +               // Matches one or more words or hyphens.
  "|" +                     // OR
  "[^\\x00-\\x7F]+-*" +     // Matches one or more character that is not in the ASCII character set as well as zero or more hyphens.
  ")+" +                    // The end of our sub-domain non-capturing group, requiring at least one match.
  "\\." +                   // An escaped colon that'll serve as the separator for our sub-domain.
  "(?:" +                   // The beginning of our domain non-capturing group including the colon separator.
  "(?:" +                   // The beginning of our domain non-capturing group excluding the colon separator.
  "(?!-)" +                 // Skips the match if a sub-domain begins with a hyphen.
  "[\\w-]+" +               // Matches one or more words or hyphens.
  "|" +                     // OR
  "[^\\x00-\\x7F]+-*" +     // Matches one or more character that is not in the ASCII character set as well as zero or more hyphens.
  ")+" +                    // The end of our domain non-capturing group excluding the colon separator, requiring at least one match.
  "\\." +                   // An escaped colon that'll serve as the separator for our domain.
  ")*" +                    // The end of our domain non-capturing group, including the colon separator, requiring zero or more matches.
  "(?:" +                   // The beginning of our top-level domain non-capturing group.
  "(?!-)" +                 // Skips the match if a domain begins with a hyphen.
  "[\\w-]+" +               // Matches one or more words or hyphens.
  "|" +                     // OR
  "[^\\x00-\\x7F]+-*" +     // Matches one or more character that is not in the ASCII character set as well as zero or more hyphens.
  ")*" +                    // The end of our top-level domain non-capturing group, requiring zero or more matches.
  ")", "im");               // The end of our capture group, and the end of our regex! Phew! The "gi" is to make the expression global and case-insensitive.

const urls = [
  'http://www.例如.中国/',
  'http://www.würstchen.mit.käsebrötchen.de:8080/news/index.html',
  'https://www.fußballspiel.de/',
  'http://www.simulateur-prêt.fr'
];

const hostnames = urls.map(url => {
  return regexp.exec(url)[1];
})

hostnames.forEach((hostname, index) => {
  console.log('Input:', urls[index], '\nOutput:', hostname);
})

希望对您有所帮助!一切顺利!

关于javascript - 可以从 URL 中提取所有子域 + 域并与 RFC 3490 兼容的正则表达式，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/57066708/

javascript - 可以从 URL 中提取所有子域 + 域并与 RFC 3490 兼容的正则表达式

上一篇：node.js - Nodejs CPU 密集型任务

下一篇：javascript - 对集群和子进程如何工作的困惑