python - 如何打印文本文件中的某些行和行的某些部分

标签 python sed awk

我正在尝试从 HTML 文件中提取信息(已导出 Google Chrome 书签)

它包含以下格式的文本 我想只提取 <DT><A HREF= 之后的网站地址和之前 ADD_DATE=

我正在考虑使用 SED 和 AWK 或 Python 所以欢迎三种语言的任何答案

到目前为止,我只知道如何打印包含 <DT><A HREF= 的行用 awk

awk '/<DT><A HREF="*"/' favorit.html

我想我应该把它和 sed 结合起来

<!DOCTYPE NETSCAPE-Bookmark-file-1>
<!-- This is an automatically generated file.
     It will be read and overwritten.
     DO NOT EDIT! -->
<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=UTF-8">
<TITLE>Bookmarks</TITLE>
<H1>Bookmarks</H1>
<DL><p>
    <DT><H3 ADD_DATE="0" LAST_MODIFIED="1309451494" PERSONAL_TOOLBAR_FOLDER="true">Barre de favoris</H3>
    <DL><p>
        <DT><H3 ADD_DATE="1281455379" LAST_MODIFIED="1309422816">brain</H3>
        <DL><p>
            <DT><A HREF="http://gmazars.info/conf/index.html" ADD_DATE="1281455379" ICON="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABAAAAAQCAYAAAAf8/9hAAABVklEQVQ4jcWSoVcjQQyHP3i4VM/ojiZ+sRTHva7u2nqulit2/4Haes5XM7WsH/SuntUz+k5wb6AtjxMIoiYvk1+SLzn7cXv7hy/Y+VeSPxRwzqGqiEjxjbUlLiKoavEv3gfWDw9YY0g5Y63l1/0917MZqsrPuzsAFk1z4BeBeV0DsFwuy8epc+y9p65rRIScM6rK3vvTDlSVEEIJ/H58LO8xRqqrK0IIOOdo2/ZU4NiMteSUyDnz3HVUVcVEhL7vGWM8hei952Y2w1j7ymO9ZtE0AOy9R1W5PGr/oAP/9IQxhs1mg4jQdV0Zo+97ckpUVcV2uz0QOPv2QyojrFYrAHa7HSJSiMd/wGQyKUnWmLKxgy3EcSTlXODN65oYIxMRps4x9D1d17FoGtq2ZYzxTSC8vLyCtPatqgjDMJBTYhgGUs6EELh8dy+fQjw+ro/sU4j/Swb4C6whlU0nCEWKAAAAAElFTkSuQmCC">Computer Vision Resource</A>
            <DT><A HREF="http://research.google.com/" ADD_DATE="1281455379">Google Research</A>
            <DT><A HREF="http://research.microsoft.com/en-us/" ADD_DATE="1281455379" ICON="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABAAAAAQCAYAAAAf8/9hAAABl0lEQVQ4jaWSTWuTQRRGz52575uYhaYF0VZbk0ChyaJWCvpv3Opv0607QfzYuBKtgktTsjAUTFuSkmTembku0uzaEtoDsxlmnjk8d2QWs3ELdB4zN00QQEPO3NTBCWiVbhkQopGXO3aRJAIYZgvNqwMEnVUzYjKcczjnEBGqEBYHvOPSgkTADO8EPRmN+PLxA831dbzzpJSYjMc86XTY6faIMSIimNnCRoSFHzgztKiVPG61ePf2Dd29p+wdHPCoKCkKZTT6hxMhhEC9XieEQIwVXz9/4tnzF2xsbKLJMjFGBkd9ZvMZv75/Q7XgbrPJ8fAvm1vbjM9OqULAeWW73WLQ7/P75yEvX73GhSqQyfT293HeE1Pk/sMHDI7+YJapqsDxcEi2zE6vy3Q6ZavdprO7y+R8grz/cWjzqsKJEGMkzOcAlLUaKUUAUkyUtZKiLDkdnXCn0UDVU5Ylmi0vRgZ4VRpFwbJ6LXRZO4aRUuLe2hpm+WLqhpoZy7UKSysAcw630q1rcKu+fBlmhpaqiFz3Ya+m8J7/gTTD19UuBM8AAAAASUVORK5CYII=">Microsoft Research - Turning Ideas into Reality</A>
            <DT><A HREF="http://techresearch.intel.com/articles/index.html" ADD_DATE="1281455379">Intel Labs</A>
            <DT><A HREF="http://www.ibm.com/developerworks/" ADD_DATE="1309092502" ICON="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABAAAAAQCAYAAAAf8/9hAAAAWUlEQVQ4jWNkSJ3xn4ECwESJ5uFkwP9Z6XCMjQ8TQwYwPhMDAwMDY9pMBmQaxmZMm4nVEGQxFlxOgylCNhSnF7ABbC6A8ZENhYcButNgYUDIBYyjKXEQGAAAPiUyGXrLJGMAAAAASUVORK5CYII=">IBM developerWorks : IBM&#39;s resource for developers and IT professionals</A>
            <DT><A HREF="http://www.siam.org/" ADD_DATE="1281455379" ICON="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABAAAAAQCAYAAAAf8/9hAAAAWklEQVQ4jd2RwQrAIAxDU7+8fvnbYR6s2sMUxlguoYGGVyrJkWDfj5YdfY8AeJmATj3BKo9ZKwmDnGzO/BHBXADFTDKruhV9zkfV1XW5RgIAa7/cVjlZ/knBBdfjeU7uim0TAAAAAElFTkSuQmCC">SIAM: Society for Industrial and Applied Mathematics</A>
            <DT><A HREF="http://www.wolfram.com/" ADD_DATE="1281455379" ICON="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABAAAAAQCAYAAAAf8/9hAAAC7UlEQVQ4jYWTX0hTcRzFz3Y37931zv2x5dzUiWZz2TQFkUVYVEpiVFCziB4CoyAYQUEGQmi9S0QREb3ZQ2aEoVQP1bJSg7SmpqKmU6bOOV3debfddb23p8XUovP2+f3O98eB7/kR+I88N1sOGfc6ptxut/S3e3kyPL7SUOnZV7szwQONjZbC9s6uw/OgEmdtTpehraVFlWBF8gM+glDaFkLvArsPNLMG6rP2ScctbnxEjtiqq83hfMCHZa6MZcG4yLKXEjOyjZHuq8uk89wPSA5ZNKgSVYsCoOyJYUxfOMP4CYulsvBoQfed539NAABzKUK4h9eqrVazSqjIQOoKh4mVUUwPRiwnwUCg+YVkvwwAWk9cvGp6OnidyhFSQ3QckVEa+jwGYq4Sy1wEA0OrKNBTOCdGgFg8+K3Y4qBEmleAZ+QAMFdjuWcu1wUrFCyKi7Kxq0SJ0h06pOtSQGrWYDDGkG9TAFV6CCuLW4rcfROGn3Oz0QLTMQUANNQ3hKvPnq6SB7XjSsUatDY1mD0lILcYIPnGIIo9EM06hG0p4DI5iGtUgDdqqm0P73r+rFEd4JvHPUuY6f+OeKYeqwYaZLoGuepM5DBpYPu8WJrmQaQo4Cste533qtMDAMRtl4u8orS+CLwfONoXWgIoEiRDIaoiEedXsdU7j1/Dk/AOs/CNxqEJk0gLsvaay/UfW7s/TBG128v3+yd9W32IunO1wvfsckuJCBmEeT/IsB/E9DQENg5Br8TEzzC+cDKEBAqpoR/HD5+u82/qwYi9aFm1w6yfGZsFHViA0Z6H7OJ8oN+L/qFfeLst69rX8eC9PC1jNWenra/BXaeT6YBVemOxS48MJvYZ0iXOapciNrMUAi0tKR3S26YmKnlm3V+Q4prKwMHSC1MmUzpLph2hqUIh7KPRKdeSOkRkfiK2rGNMzo2p/6mX5acGXtaeqU1wW5ZD1XWjxZ7s2VTlZHGVxe38vPdTgut8vVFc7x1K9vwGs2EnwS3T7R0AAAAASUVORK5CYII=">Wolfram Research: Mathematica, Technical and Scientific Software</A>
            <DT><A HREF="http://www.mathworks.com/" ADD_DATE="1281455379">The MathWorks - MATLAB and Simulink for Technical Computing</A>
            <DT><A HREF="http://www.youtube.com/user/GoogleTechTalks" ADD_DATE="1281455379" ICON="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABAAAAAQCAYAAAAf8/9hAAABS0lEQVQ4jZ2SP0hCURTGf+clQdAQ2tKsUHNbQ9SgCI5BU/sbpSnaWrPRxamlyUJsaKlcWqIaoigKk+ckNKVOBYF6Gu57r/fyT9a3nO+ce893zr3noKr6VM7qTmZaVVWTdkkPLhs6LvAJaPvQ2L/Av/18nFXAr36eTyugdu7GLxC0fQLvr0cKaLXT0w+XewnVTm+ogMWY2N+M8dLVvnjEIyJiiCpTc+uc5dOICHbumvkJobuywULEImmXQgLitvVvRKjVILPk6QFj6ilweoVoPKo4ze+DRCzYIDhvP2JhmE9MzA5I9sqMhhW6uLgMTtO4TtP4uLxQNLxQhNs6lC+AwBRC8CbiIREzIlu7kEq5wcYIgd9a958qAwQqFUKfendv/JMT2NsGS4z/8Ahrq2YK7c+W3/mwrQjsGWKKMzMZNVzj0UDasF3oj0u9JV8pACCxycNu1gAAAABJRU5ErkJggg==">YouTube - Chaîne de GoogleTechTalks</A>
            <DT><A HREF="http://groups.csail.mit.edu/vision/welcome/" ADD_DATE="1281455379" ICON="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABAAAAAQCAYAAAAf8/9hAAAB0klEQVQ4jY2T30tTYRjHP+mcp8mWzBFKJVk4CyVZI+wH1Z1FQtFdV3Uxr4bX/gkS4eXoquxGvOgmuhg0jWRRwhgrBks7Z81AZzuy49Kj+6Fm62LsXafZYd+r9+F9v5/neb+875FyuVzGRJlPIZR3MwDslQoADNwZpeviLQCazMwAywuv0bNpAKySjZ8/kuiZpNi3mJl3dY3cqswV3wQu9xAA4clHFAuFwwFyMIC6FBV1tbO982zNILUZmgjArq4RDz6jxzuM88wFceDYiV5aHS5R2xzt/D74VQ/IfYtx1N7BJd8Ts1vVSYSoJsI4T/U1ZDrIb9YD1lMJOvuvNgTYK+6ItQVAUyLsl/Kc9N5uCFDYzKIpkRpgay1Ji9RmCKsqTYmQjofJylFK2zmK2xsAfH45iVWyVQDp2BzHz182GOVgAOX9KwAku5Nuz00cXb2oiTAAnocTlQlS89OsL8fpv+sHIDU/zeKbFwC4b9zn9PUHhsn0TBJdXalloH5ZoMc7jMs9RPT5ON9jswyOjNI3MvbfDAwhXht7CsDHgJ/cqsy9x28PzeLfEAUAKj9ubbEyyVJoyvDS/lZTswU18YHmllYjoKqNla+mnQH2S3k6us+J+g88sKj7zLO6HwAAAABJRU5ErkJggg==">MIT CSAIL Computer Vision Research Group</A>
            <DT><A HREF="http://www.youtube.com/watch?v=9K8X__I2O2A&feature=related" ADD_DATE="1281455379" ICON="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABAAAAAQCAYAAAAf8/9hAAABS0lEQVQ4jZ2SP0hCURTGf+clQdAQ2tKsUHNbQ9SgCI5BU/sbpSnaWrPRxamlyUJsaKlcWqIaoigKk+ckNKVOBYF6Gu57r/fyT9a3nO+ce893zr3noKr6VM7qTmZaVVWTdkkPLhs6LvAJaPvQ2L/Av/18nFXAr36eTyugdu7GLxC0fQLvr0cKaLXT0w+XewnVTm+ogMWY2N+M8dLVvnjEIyJiiCpTc+uc5dOICHbumvkJobuywULEImmXQgLitvVvRKjVILPk6QFj6ilweoVoPKo4ze+DRCzYIDhvP2JhmE9MzA5I9sqMhhW6uLgMTtO4TtP4uLxQNLxQhNs6lC+AwBRC8CbiIREzIlu7kEq5wcYIgd9a958qAwQqFUKfendv/JMT2NsGS4z/8Ahrq2YK7c+W3/mwrQjsGWKKMzMZNVzj0UDasF3oj0u9JV8pACCxycNu1gAAAABJRU5ErkJggg==">YouTube - Hello World through custom UART to HyperTerminal</A>

最佳答案

一个非常快速的解决方案是在 Python 中使用正则表达式。

假设变量 s 包含你的 HTML 字符串:

import re

s = '''   <DT><A HREF="http://gmazars.info/conf/index.html" 
            <DT><A HREF="http://research.google.com/" 
            <DT><A HREF="http://research.microsoft.com/en-us/" 
            <DT><A HREF="http://techresearch.intel.com/articles/index.html" 
'''

print re.findall("HREF=\"(.*?)\"", s)

关于python - 如何打印文本文件中的某些行和行的某些部分,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/6551025/

相关文章:

python - os.listdir 看不到我的目录

python - argparse 长选项的单破折号

python - 我怎么知道关闭资源操作需要遵循哪个功能

bash - SED 命令在 Vagrant Init 脚本中不起作用

bash - 如何在 bash 中将列表转置为表

Python 在 Docker 容器中找不到文件

linux - 如何在 Linux Bash 脚本文件中使用 'sed' 来注释掉带有制表符间距的特定行?

linux - 在文件中查找一行并在 bash 中的行末添加一些内容

regex - 更改匹配行中的单词

linux - 用于正常运行时间监控的 Unix shell 命令