javascript - 可以从 URL 中提取所有子域 + 域并与 RFC 3490 兼容的正则表达式

标签 javascript node.js regex

我正在寻找可以从 URL 中提取所有子域 + 域的正则表达式。

我已经从 here 找到了这个:

/([a-z0-9|-]+\.)*[a-z0-9|-]+\.[a-z]+/

它能够提取子域 + 域,但不幸的是,它不关心子域/域前面的 -,也不支持非 ASCII 字符,如 RFC 3490 中指定的那样

这里有一些我想捕获的例子:

http://www.例如.中国/
http://www.würstchen.mit.käsebrötchen.de:8080/news/index.html
https://www.fußballspiel.de/
http://www.simulateur-prêt.fr

最佳答案

我整理了以下正则表达式,我也对其进行了大量评论,希望能更好地描述正在发生的事情。它匹配所有 ASCII 和非 ASCII 字符,并成功地从您的示例中提取所需信息。

正则表达式示例:

const regexp = new RegExp(
  "^" +                     // Ensures a match is found only if it starts at the beginning of a string.
  "(?:^\\w+:\\/\\/)?" +     // Matches a protocol at the beginning of the string, which is optional.
  "(" +                     // The beginning of our capture group.
  "(?:" +                   // The beginning of our sub-domain non-capturing group.
  "(?!-)" +                 // Skips the match if a sub-domain begins with a hyphen.
  "[\\w-]+" +               // Matches one or more words or hyphens.
  "|" +                     // OR
  "[^\\x00-\\x7F]+-*" +     // Matches one or more character that is not in the ASCII character set as well as zero or more hyphens.
  ")+" +                    // The end of our sub-domain non-capturing group, requiring at least one match.
  "\\." +                   // An escaped colon that'll serve as the separator for our sub-domain.
  "(?:" +                   // The beginning of our domain non-capturing group including the colon separator.
  "(?:" +                   // The beginning of our domain non-capturing group excluding the colon separator.
  "(?!-)" +                 // Skips the match if a sub-domain begins with a hyphen.
  "[\\w-]+" +               // Matches one or more words or hyphens.
  "|" +                     // OR
  "[^\\x00-\\x7F]+-*" +     // Matches one or more character that is not in the ASCII character set as well as zero or more hyphens.
  ")+" +                    // The end of our domain non-capturing group excluding the colon separator, requiring at least one match.
  "\\." +                   // An escaped colon that'll serve as the separator for our domain.
  ")*" +                    // The end of our domain non-capturing group, including the colon separator, requiring zero or more matches.
  "(?:" +                   // The beginning of our top-level domain non-capturing group.
  "(?!-)" +                 // Skips the match if a domain begins with a hyphen.
  "[\\w-]+" +               // Matches one or more words or hyphens.
  "|" +                     // OR
  "[^\\x00-\\x7F]+-*" +     // Matches one or more character that is not in the ASCII character set as well as zero or more hyphens.
  ")*" +                    // The end of our top-level domain non-capturing group, requiring zero or more matches.
  ")", "im");               // The end of our capture group, and the end of our regex! Phew! The "gi" is to make the expression global and case-insensitive.

const urls = [
  'http://www.例如.中国/',
  'http://www.würstchen.mit.käsebrötchen.de:8080/news/index.html',
  'https://www.fußballspiel.de/',
  'http://www.simulateur-prêt.fr'
];

const hostnames = urls.map(url => {
  return regexp.exec(url)[1];
})

hostnames.forEach((hostname, index) => {
  console.log('Input:', urls[index], '\nOutput:', hostname);
})

希望对您有所帮助!一切顺利!

关于javascript - 可以从 URL 中提取所有子域 + 域并与 RFC 3490 兼容的正则表达式,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/57066708/

相关文章:

python - 替换相同字符的序列

javascript - Node.js 和 MongoDB 代码澄清

javascript - HTML5 Canvas - 在 Canvas 上拖动文本问题

一般的 Javascript 对象无限方法

javascript - 如何在一个组件vue上创建多个图表

mysql - Node mysql顺序查询执行

javascript - NodeJS - 使用 mongoose 解析大型集合

java - 正则表达式替换java中csv字符串中的空值

python - 捕获多行文本中出现的多个字符串

javascript - 我想读取一个带有格式化文本的 html 标签。使用 JavaScript