我正在寻找可以从 URL 中提取所有子域 + 域的正则表达式。
我已经从 here 找到了这个:
/([a-z0-9|-]+\.)*[a-z0-9|-]+\.[a-z]+/
它能够提取子域 + 域,但不幸的是,它不关心子域/域前面的 -
,也不支持非 ASCII 字符,如 RFC 3490 中指定的那样
这里有一些我想捕获的例子:
http://www.例如.中国/
http://www.würstchen.mit.käsebrötchen.de:8080/news/index.html
https://www.fußballspiel.de/
http://www.simulateur-prêt.fr
最佳答案
我整理了以下正则表达式,我也对其进行了大量评论,希望能更好地描述正在发生的事情。它匹配所有 ASCII 和非 ASCII 字符,并成功地从您的示例中提取所需信息。
正则表达式示例:
const regexp = new RegExp(
"^" + // Ensures a match is found only if it starts at the beginning of a string.
"(?:^\\w+:\\/\\/)?" + // Matches a protocol at the beginning of the string, which is optional.
"(" + // The beginning of our capture group.
"(?:" + // The beginning of our sub-domain non-capturing group.
"(?!-)" + // Skips the match if a sub-domain begins with a hyphen.
"[\\w-]+" + // Matches one or more words or hyphens.
"|" + // OR
"[^\\x00-\\x7F]+-*" + // Matches one or more character that is not in the ASCII character set as well as zero or more hyphens.
")+" + // The end of our sub-domain non-capturing group, requiring at least one match.
"\\." + // An escaped colon that'll serve as the separator for our sub-domain.
"(?:" + // The beginning of our domain non-capturing group including the colon separator.
"(?:" + // The beginning of our domain non-capturing group excluding the colon separator.
"(?!-)" + // Skips the match if a sub-domain begins with a hyphen.
"[\\w-]+" + // Matches one or more words or hyphens.
"|" + // OR
"[^\\x00-\\x7F]+-*" + // Matches one or more character that is not in the ASCII character set as well as zero or more hyphens.
")+" + // The end of our domain non-capturing group excluding the colon separator, requiring at least one match.
"\\." + // An escaped colon that'll serve as the separator for our domain.
")*" + // The end of our domain non-capturing group, including the colon separator, requiring zero or more matches.
"(?:" + // The beginning of our top-level domain non-capturing group.
"(?!-)" + // Skips the match if a domain begins with a hyphen.
"[\\w-]+" + // Matches one or more words or hyphens.
"|" + // OR
"[^\\x00-\\x7F]+-*" + // Matches one or more character that is not in the ASCII character set as well as zero or more hyphens.
")*" + // The end of our top-level domain non-capturing group, requiring zero or more matches.
")", "im"); // The end of our capture group, and the end of our regex! Phew! The "gi" is to make the expression global and case-insensitive.
const urls = [
'http://www.例如.中国/',
'http://www.würstchen.mit.käsebrötchen.de:8080/news/index.html',
'https://www.fußballspiel.de/',
'http://www.simulateur-prêt.fr'
];
const hostnames = urls.map(url => {
return regexp.exec(url)[1];
})
hostnames.forEach((hostname, index) => {
console.log('Input:', urls[index], '\nOutput:', hostname);
})
希望对您有所帮助!一切顺利!
关于javascript - 可以从 URL 中提取所有子域 + 域并与 RFC 3490 兼容的正则表达式,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/57066708/