python - 使用python将HTML页面中的所有链接替换为本地地址

标签 python html regex beautifulsoup

我刚开始使用 python 处理 html 页面，我正在尝试创建一个非常简单的网络爬虫。我已经设法下载了我正在使用的网站上的所有链接，但要完全离线，我需要将网站上的所有 URL 替换为本地地址，例如: 我已将页面“www.domain.com/news”保存在下一个路径下:“myfile/sub0/0” 我如何使用 python 替换我下载到地址的每个 html 页面中的 URL？我已经使用此正则表达式获得了链接列表:

urls = re.findall('href=[\'"]?(http://[^\'" >]+)', htmlSource)

最佳答案

从目录中读取每个 html 文件，并执行如下所示的正则表达式替换，将 url 更改为您的 url。这是更改 href 链接的示例。

response = """
<a class="abc" href="http://www.example.com/abc.py">link a</a>
<a class="xyz" href="/xyz.py">link x</a>
"""
response = re.sub("(<a [^>]*href\s*=\s*['\"])(https?://www\.example\.com)?/?", "\\1myfile/sub0/0/", response)
print response;

输出:

<a class="abc" href="myfile/sub0/0/abc.py">link a</a>
<a class="xyz" href="myfile/sub0/0/xyz.py">link x</a>

请注意，您可能需要调整正则表达式以满足您的需要。

关于python - 使用python将HTML页面中的所有链接替换为本地地址，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/20924492/

上一篇：R glmnet : segmentation fault when using multinomial and pmax

下一篇：google-translate - 如何让谷歌翻译显示原始语言名称

相关文章：

python - 为什么 pickle.dump 不写入文件？

python - 正则表达式:如何查找括号内的字符 *except*

python - 在python中实现COM接口(interface)类型库

python - 处理 scikit-learn tree.decisiontreeclassifier 中的维度

python - 在 stackless python 中，通过 channel 发送的数据是不可变的吗？

css - float div定位

html - 导航不会右对齐

html - 无法将 html 元素样式设置为透明

regex - 使用正则表达式从字符串中提取变量

javascript - 匹配 101 - 999 的正则表达式模式