python正则表达式查找并替换具有特定属性值的html标签

标签 python regex

我试图在 python 中编写一个正则表达式,它会找到所有 img 标签,其中 src 属性等于特定值。我试着写下面的

   # where thm equal /public_media/cache/84/b5/84b59e293cbdb7041b68a84977d62cf3.jpg?image_pk=82
   p = re.compile(r'<img.*?%s.*?>' % thm)
   print p.pattern
   print p.sub(linked_image, c)

下面是我得到的输出

<img.*?/public_media/cache/84/b5/84b59e293cbdb7041b68a84977d62cf3.jpg?image_pk=82.*?>

<p><img src="/public_media/cache/84/b5/84b59e293cbdb7041b68a84977d62cf3.jpg?image_pk=82" alt=""></p><p>lksj lksdfj lsdkfj sldkfj sldkfj lskdfj lsjf lksjf lksj flksdjf klsj flk dkj sdlkfj sdlkfj sldkjf sldkfj lsdkjf lskjflsjfsl lksdjf </p><p>lksj lksdfj lsdkfj sldkfj sldkfj lskdfj lsjf lksjf lksj flksdjf klsj flk dkj sdlkfj sdlkfj sldkjf sldkfj lsdkjf lskjflsjfsl lksdjf 
</p><p>lksj lksdfj lsdkfj sldkfj sldkfj lskdfj lsjf lksjf lksj flksdjf klsj flk dkj sdlkfj sdlkfj sldkjf sldkfj lsdkjf lskjflsjfsl lksdjf 
</p><p>lksj lksdfj lsdkfj sldkfj sldkfj lskdfj lsjf lksjf lksj flksdjf klsj flk dkj sdlkfj sdlkfj sldkjf sldkfj lsdkjf lskjflsjfsl lksdjf 
</p><p>lksj lksdfj lsdkfj sldkfj sldkfj lskdfj lsjf lksjf lksj flksdjf klsj flk dkj sdlkfj sdlkfj sldkjf sldkfj lsdkjf lskjflsjfsl lksdjf 
</p><p>lksj lksdfj lsdkfj sldkfj sldkfj lskdfj lsjf lksjf lksj flksdjf klsj flk dkj sdlkfj sdlkfj sldkjf sldkfj lsdkjf lskjflsjfsl lksdjf 
</p><p>lksj lksdfj lsdkfj sldkfj sldkfj lskdfj lsjf lksjf lksj flksdjf klsj flk dkj sdlkfj sdlkfj sldkjf sldkfj lsdkjf lskjflsjfsl lksdjf 
</p><p>lksj lksdfj lsdkfj sldkfj sldkfj lskdfj lsjf lksjf lksj flksdjf klsj flk dkj sdlkfj sdlkfj sldkjf sldkfj lsdkjf lskjflsjfsl lksdjf 
</p><p>lksj lksdfj lsdkfj sldkfj sldkfj lskdfj lsjf lksjf lksj flksdjf klsj flk dkj sdlkfj sdlkfj sldkjf sldkfj lsdkjf lskjflsjfsl lksdjf 
</p><p>lksj lksdfj lsdkfj sldkfj sldkfj lskdfj lsjf lksjf lksj flksdjf klsj flk dkj sdlkfj sdlkfj sldkjf sldkfj lsdkjf lskjflsjfsl lksdjf 
</p><p>lksj lksdfj lsdkfj sldkfj sldkfj lskdfj lsjf lksjf lksj flksdjf klsj flk dkj sdlkfj sdlkfj sldkjf sldkfj lsdkjf lskjflsjfsl lksdjf 
</p><p>lksj lksdfj lsdkfj sldkfj sldkfj lskdfj lsjf lksjf lksj flksdjf klsj flk dkj sdlkfj sdlkfj sldkjf sldkfj lsdkjf lskjflsjfsl lksdjf 

最佳答案

LXML 的解决方案

为了比较正则表达式和 LXML 的解决方案,我创建了另一篇文章:

一个更简单和更稳定的解决方案是将 lxmletree 一起使用。在那个解决方案中你 访问某些 DOM 元素并编辑它们。

转换 HTML 字符串并通过正确的 xpath 获取它,例如.//imgxpath 返回所有已找到元素的列表,您可以在其中获取设置 src 属性。 函数 etree.tostring(tree) 返回一个编辑过的字符串:

from lxml import etree
tree = etree.HTML('''<html>
                     <body>
                        <h1>Title</h1>
                        <img src="/media/old/another_logo.png" alt="" />
                        <p>Lorem Ipsum</p>
                        <p><img src="/media/old/logo.png" alt=""/></p>
                     </body>
                  </html>''')

imgs = tree.xpath('.//img')

for img in imgs:
    print 'OLD_SOURCE', img.get('src')
    img.set('src', '/media/new/python.jpg')

print etree.tostring(tree)

输出

OLD_SOURCE /media/old/another_logo.png
OLD_SOURCE /media/old/logo.png

<html>
    <body>
        <h1>Title</h1>
            <img src="/media/new/python.jpg" alt=""/>
            <p>Lorem Ipsum</p>
            <p><img src="/media/new/python.jpg" alt=""/></p>
    </body>
</html>

关于python正则表达式查找并替换具有特定属性值的html标签,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/20595735/

相关文章:

python - 正则表达式匹配双下划线?

php - 如何 preg_match 第一次出现在字符串中

Python文件读取问题

python - 如何创建局部于 with 语句的变量?

php - preg_replace_callback 只替换第一次出现的

python重新查找可能包含括号的字符串

regex - Emacs:如何使用正则表达式替换字符串?

python - 尝试将矩阵旋转 90 度但无法正常工作

python - 将输出重定向到文件 python 时回车无法正常工作

Python Regex 查找 1000 美元或更多的金额