我需要从 html 文件中定位并提取图像源。例如，它可能包含:

<image class="logo" src="http://example.site/logo.jpg">

或

<img src="http://another.example/picture.png">

使用 Python。我不想使用任何第三方程序。不过，我可以使用 RE 模块。该程序应该:

筛选一切
找出img或 image标签
找到src并获取属性值(不带双引号)

这可能吗？如果可以，我该怎么做？我们可以假设我不需要访问互联网来执行此操作(我有一个名为 website.html 的文件，其中包含所有 html 代码)。

编辑:我当前的正则表达式是

r'<img[^>]*\ssrc="(.*?)"'

和

r'<image[^>]*\ssrc="(.*?)"' .

主要问题是该表达式将拾取以 img 或图像开头的任何内容。例如，如果有内容说 <imagesomethingrandom src="website"> ，它仍然会将其视为图像(因为单词图像位于开头)并且它会添加源。

提前致谢。

罗布。

最佳答案

描述

这个表达式将:

查找所有image和 img具有 src 的标签属性
忽略不是图像或 img 的标签，例如 imagesomethingrandom
获取src属性的值
正确处理单引号、双引号或非引号属性值
避免大多数棘手的边缘情况，这些情况在匹配 html 时似乎会绊倒正则表达式

<ima?ge?(?=\s|>)(?=(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*?\ssrc=(['"]?)(.*?)\1(?:\s|>))(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*>

enter image description here

例子

Live Regex Demo
Live Python Demo

示例文本

注意第一行中相当困难的边缘情况

<img onmouseover=' src="NotTheDroidsYouAreLookingFor.png" ; if (x > 3) { funRotate(src); } ' src="http://another.example/picture.png">
<imagesomethingrandom class="logo" src="http://example.site/imagesomethingrandom.jpg">
<image class="logo" src="http://example.site/logo.jpg">
<img src="http://another.example/DoubleQuoted.png">
<image src='http://another.example/SingleQuoted.png'>
<img src=http://another.example/NotQuoted.png>

Python 代码

#!/usr/bin/python
import re

string = """<img onmouseover=' src="NotTheDroidsYouAreLookingFor.png" ; if (x > 3) { funRotate(src); } ' src="http://another.example/picture.png">
<imagesomethingrandom class="logo" src="http://example.site/imagesomethingrandom.jpg">
<image class="logo" src="http://example.site/logo.jpg">
<img src="http://another.example/DoubleQuoted.png">
<image src='http://another.example/SingleQuoted.png'>
<img src=http://another.example/NotQuoted.png>
""";

regex = r"""<ima?ge?(?=\s|>)(?=(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*?\ssrc=(['"]?)(.*?)\1(?:\s|>))(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*>""";

intCount = 0

for matchObj in re.finditer( regex, string, re.M|re.I|re.S):
    print " "
    print "[", intCount, "][ 0 ] : ", matchObj.group(0)
    print "[", intCount, "][ 1 ] : ", matchObj.group(1)
    print "[", intCount, "][ 2 ] : ", matchObj.group(2)
    intCount+=1

捕获组

Group 0 获取整个图像或 img 标签
第 1 组获取包围 src 属性的引号，如果它存在的话
Group 2获取src属性值

[ 0 ][ 0 ] :  <img onmouseover=' src="NotTheDroidsYouAreLookingFor.png" ; if (x > 3) { funRotate(src); } ' src="http://another.example/picture.png">
[ 0 ][ 1 ] :  "
[ 0 ][ 2 ] :  http://another.example/picture.png

[ 1 ][ 0 ] :  <image class="logo" src="http://example.site/logo.jpg">
[ 1 ][ 1 ] :  "
[ 1 ][ 2 ] :  http://example.site/logo.jpg

[ 2 ][ 0 ] :  <img src="http://another.example/DoubleQuoted.png">
[ 2 ][ 1 ] :  "
[ 2 ][ 2 ] :  http://another.example/DoubleQuoted.png

[ 3 ][ 0 ] :  <image src='http://another.example/SingleQuoted.png'>
[ 3 ][ 1 ] :  '
[ 3 ][ 2 ] :  http://another.example/SingleQuoted.png

[ 4 ][ 0 ] :  <img src=http://another.example/NotQuoted.png>
[ 4 ][ 1 ] :  
[ 4 ][ 2 ] :  http://another.example/NotQuoted.png

关于html - Python 3.3.2 - 在 HTML 中查找图像源，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/18284454/

html - Python 3.3.2 - 在 HTML 中查找图像源

描述

例子

上一篇：javascript - 使用 JavaScript 生成所有 HTML

下一篇：javascript - 将隐藏选项卡中的 HTML5 验证事件传递给 JavaScript 函数