python - 删除基本网址

标签 python web-scraping beautifulsoup

我写了一个 python 脚本来提取 href给定网页上所有链接的值(value):

from BeautifulSoup import BeautifulSoup
import urllib2
import re

html_page = urllib2.urlopen("http://kteq.in/services")
soup = BeautifulSoup(html_page)
for link in soup.findAll('a'):
    print link.get('href')

当我运行上面的代码时,我得到以下输出,其中包括外部链接和内部链接:

index
index
#
solutions#internet-of-things
solutions#online-billing-and-payment-solutions
solutions#customer-relationship-management
solutions#enterprise-mobility
solutions#enterprise-content-management
solutions#artificial-intelligence
solutions#b2b-and-b2c-web-portals
solutions#robotics
solutions#augement-reality-virtual-reality`enter code here`
solutions#azure
solutions#omnichannel-commerce
solutions#document-management
solutions#enterprise-extranets-and-intranets
solutions#business-intelligence
solutions#enterprise-resource-planning
services
clients
contact
#
#
#
https://www.facebook.com/KTeqSolutions/
#
#
#
#
#contactform
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
index
services
#
contact
#
iOSDevelopmentServices
AndroidAppDevelopment
WindowsAppDevelopment
HybridSoftwareSolutions
CloudServices
HTML5Development
iPadAppDevelopment
services
services
services
services
services
services
contact
contact
contact
contact
contact
None
https://www.facebook.com/KTeqSolutions/
#
#
#
#

我想删除具有完整 URL 的外部链接,例如 https://www.facebook.com/KTeqSolutions/同时保留类似 solutions#internet-of-things 的链接.我怎样才能有效地做到这一点?

最佳答案

如果我没理解错的话,你可以试试这样的方法:

l = []
for link in soup.findAll('a'):
    print link.get('href')
    l.append(link.get('href'))
l = [x for x in l if "www" not in x] #or 'https'

关于python - 删除基本网址,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/51836884/

相关文章:

python - 在 Google Cloud Storage 中存储多个同名文件?

Python 从 URL 抓取 pdf

python - 尝试使用 BeautifulSoup 显示网站上的文本

python - Beautifulsoup - 当链接也在容器中时,findAll 找不到字符串

php - Google map API - 标记、MySQL

python - 如何遍历 MeshGrid?

python - BeautifulSoup 只提取顶级标签

node.js - 如何在 Node.js 中从 xml 中抓取 url?

python - 使用 Python 和 BeautifulSoup 抓取时模拟单击链接

python - BeautifulSoup 循环遍历 URL 数组