html - 如何使用正则表达式来匹配HTML中的charset字符串?

标签 html regex

HTML 代码示例:

<meta http-equiv="Content-type" content="text/html;charset=utf-8" />

我想使用 RegEx 提取字符集信息(即这里是“utf-8”)

(我正在使用 C#)

最佳答案

我的回答提供了@Floyd 的更健壮的版本,并且在可能的程度上解决了@You 的破损测试用例,其中使用了负面前瞻来避免它。我真的只能想到一个相关案例(@You 示例的变体),它会给出误报,但我认为这种情况很少见。表达式应使用不区分大小写的标志运行,并使用 java.util.regex 进行测试。和 JRegex .

捕获组会自动修剪,并且从不包含引号,也不包含其他标记字符,如“/”或“>”。在第二个表达式中,有 2 个捕获组;第一个是内容类型值,它可能是空的(即,当使用字符集属性时),第二个是字符集值,它总是非空的(除非字符集值由于某些奇怪的原因而被字面上留空).

仅用于匹配/分组字符集值的正则表达式 - 修剪,跳过引号

<meta(?!\s*(?:name|value)\s*=)[^>]*?charset\s*=[\s"']*([^\s"'/>]*)

与上面相同,但也匹配/分组内容类型(可选)和字符集(必需)值,修剪,跳过引号。小警告 - 未匹配独立内容类型值,即“text/html”

<meta(?!\s*(?:name|value)\s*=)(?:[^>]*?content\s*=[\s"']*)?([^>]*?)[\s"';]*charset\s*=[\s"']*([^\s"'/>]*)

测试用例(除最后一个外全部通过)...

<meta http-equiv="Content-Type" content="text/html;charset=iso-8859-1"/>
<meta http-equiv="Content-Type" content="text/html;charset=iso-8859-1" />
<meta http-equiv='Content-Type' content='text/html;charset=iso-8859-1'/>
<meta http-equiv='Content-Type' content='text/html;charset=iso-8859-1' />
<meta http-equiv=Content-Type content=text/html;charset=iso-8859-1/>
<meta http-equiv=Content-Type content=text/html;charset=iso-8859-1 />
<meta http-equiv="Content-Type" content="text/html;charset=iso-8859-1">
<meta http-equiv="Content-Type" content="text/html;charset=iso-8859-1" >
<meta http-equiv='Content-Type' content='text/html;charset=iso-8859-1'>
<meta http-equiv='Content-Type' content='text/html;charset=iso-8859-1' >
<meta http-equiv=Content-Type content=text/html;charset=iso-8859-1>
<meta http-equiv=Content-Type content=text/html;charset=iso-8859-1 >

<meta http-equiv="Content-Type" content="text/html;charset='iso-8859-1'">
<meta http-equiv="Content-Type" content="'text/html;charset=iso-8859-1'">
<meta http-equiv="Content-Type" content="'text/html';charset='iso-8859-1'">
<meta http-equiv='Content-Type' content='text/html;charset="iso-8859-1"'>
<meta http-equiv='Content-Type' content='"text/html;charset=iso-8859-1"'>
<meta http-equiv='Content-Type' content='"text/html";charset="iso-8859-1"'>

<meta http-equiv="Content-Type" content="text/html;;;charset=iso-8859-1">
<meta http-equiv="Content-Type" content="text/html;;;charset='iso-8859-1'">
<meta http-equiv="Content-Type" content="'text/html;;;charset=iso-8859-1'">
<meta http-equiv="Content-Type" content="'text/html';;;charset='iso-8859-1'">
<meta http-equiv='Content-Type' content='text/html;;;charset=iso-8859-1'>
<meta http-equiv='Content-Type' content='text/html;;;charset="iso-8859-1"'>
<meta http-equiv='Content-Type' content='"text/html;;;charset=iso-8859-1"'>
<meta http-equiv='Content-Type' content='"text/html";;;charset="iso-8859-1"'>

<meta  http-equiv  =  "  Content-Type  "  content  =  "  '  text/html  '  ;  ;;  '  ;  '  '  ;  '  ;  ' ;;  ;  charset  =  '  iso-8859-1  '  "  >
<meta  content  =  "  '  text/html  '  ;  ;;  '  ;  '  '  ;  '  ;  ' ;;  ;  charset  =  '  iso-8859-1  '  "  http-equiv  =  "  Content-Type  "  >
<meta  http-equiv  =  Content-Type  content  =  text/html;charset=iso-8859-1  >
<meta  content  =  text/html;charset=iso-8859-1  http-equiv  =  Content-Type  >
<meta  http-equiv  =  Content-Type  content  =  text/html  ;  charset  =  iso-8859-1  >
<meta  content  =  text/html  ;  charset  =  iso-8859-1  http-equiv  =  Content-Type  >
<meta  http-equiv  =  Content-Type  content  =  text/html  ;;;  charset  =  iso-8859-1  >
<meta  content  =  text/html  ;;;  charset  =  iso-8859-1  http-equiv  =  Content-Type  >
<meta  http-equiv  =  Content-Type  content  =  text/html  ;  ;  ;  charset  =  iso-8859-1  >
<meta  content  =  text/html  ;  ;  ;  charset  =  iso-8859-1  http-equiv  =  Content-Type  >

<meta charset="utf-8"/>
<meta charset="utf-8" />
<meta charset='utf-8'/>
<meta charset='utf-8' />
<meta charset=utf-8/>
<meta charset=utf-8 />
<meta charset="utf-8">
<meta charset="utf-8" >
<meta charset='utf-8'>
<meta charset='utf-8' >
<meta charset=utf-8>
<meta charset=utf-8 >

<meta  charset  =  "  utf-8  "  >
<meta  charset  =  '  utf-8  '  >
<meta  charset  =  "  utf-8  '  >
<meta  charset  =  '  utf-8  "  >
<meta  charset  =  "  utf-8     >
<meta  charset  =  '  utf-8     >
<meta  charset  =     utf-8  '  >
<meta  charset  =     utf-8  "  >
<meta  charset  =     utf-8     >
<meta  charset  =     utf-8    />

<meta name="title" value="charset=utf-8 — is it really useful (yep)?">
<meta value="charset=utf-8 — is it really useful (yep)?" name="title">
<meta name="title" content="charset=utf-8 — is it really useful (yep)?">
<meta name="charset=utf-8" content="charset=utf-8 — is it really useful (yep)?">

<meta content="charset=utf-8 — is it really useful (nope, not here, but gotta admit pretty robust otherwise)?" name="title">

关于html - 如何使用正则表达式来匹配HTML中的charset字符串?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/3458217/

相关文章:

javascript - 访问在angularjs中不同 Controller 中声明的路由页面 Controller 中的变量

html - 一些 div 在 Bootstrap V3 中没有正确响应

regex - 为什么正则表达式中的转义字符不匹配?

python - 如何使用正则表达式匹配段落

regex -/bb|[^b]{2}/它是如何工作的?

html - 有没有办法在 CSS 2D 光栅化中消除子像素的锯齿?

javascript - 图像像素化的 CSS 变换

Javascript 缓动而不是动画

php - 将 Javascript 正则表达式转换为 PHP (PCRE) 表达式

regex - 正则表达式最小长度