html - 获取网站名称包含python 27中的HTML代码

当我运行 python 脚本时，我遇到了一个问题，它下载公司业务目录，如公司名称、地址、位置地址和网址。

但是当这个脚本获取公司的网站名称(例如www.example.com)时，它只是获取网站名称HTML代码而不是获取网站名称，并且还将HTML代码存储到MySQL中当前网站的服务器。

我使用了来自 BeautifulSoup、lxml、html、hashlib、urllib2 的以下 Python 库，并将网站名称 HTML 代码存储到 MYSQL 服务器中

<input><tr><td>www.example.com</td></tr></input>

我想删除此 html 标签并将公司网址(如 www.example.com)存储到 MySQL 服务器

我的代码在这里:

for hit in soup2.findAll(attrs={'id' : 'webSite_0'}):
    web = str(hit).replace('<input type="hidden" value="', '')
    web = web.replace('" id="webSite_0" />', '')
if web == "":
    flog.write("\nWebsite extraction... Failed")
    print "None"
else:
    flog.write("\nWebsite extraction... OK")
    print web
    companyObj.setWeb(web)

有关如何解决此问题的任何解决方案或任何建议。

最佳答案

您(至少)有两个选择:使用re或BeautifulSoup。

使用re

import re
cleanse_url = re.compile(r'<[^>]*>')

for hit in soup2.findAll(attrs={'id' : 'webSite_0'}):
    web = str(hit).replace('<input type="hidden" value="', '')
    web = web.replace('" id="webSite_0" />', '')
if web == "":
    flog.write("\nWebsite extraction... Failed")
    print "None"
else:
    web = cleanse_url.sub('', web)  # escape the HTML
    flog.write("\nWebsite extraction... OK")
    print web
    companyObj.setWeb(web)

使用BeautifulSoup.Tag.text

我认为这个选项更好，因为 tag.text 可以去除属性和标签。

for hit in soup2.findAll(attrs={'id' : 'webSite_0'}):
    web = hit.text # use beautifulsoup
if web == "":
    flog.write("\nWebsite extraction... Failed")
    print "None"
else:
    flog.write("\nWebsite extraction... OK")
    print web
    companyObj.setWeb(web)

关于html - 获取网站名称包含python 27中的HTML代码，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/31233407/

html - 获取网站名称包含python 27中的HTML代码

上一篇：php - 如何调试 PHP 中的 MySQL 错误？

下一篇：mysql - 基本 SELECT 查询 - 来自多个数据库的结果