python - 使用 BS4 和 python 从 HTML 文件目录中抓取网页

标签 python linux web-scraping beautifulsoup

我有一个网站,其中每个人的详细信息都存储在单独的 .HTML 文件中。因此,总共 100 个人的详细信息存储在 100 个不同的 .html 文件中。但它们都具有相同的 HTML 结构。

这是网站链接 http://www.coimbatore.com/doctors/home.htm

因此,如果您看到此网站,就会发现有很多类别,并且 ~all-doctors.html~ 文件位于同一目录中。

http://www.coimbatore.com/doctors/cardiology.htm

有 5 个医生的链接。如果我点击任何医生的名字,它将需要

http://www.coimbatore.com/doctors/ 那个医生姓名.htm。所以如果我没记错的话,所有文件都在同一个目录/doctors/中。那么如何抓取每个医生的详细信息呢?

我计划wget http://www.coimbatore.com/doctors/中的所有文件URL,保存在本地并使用 LINUX 中的 join 功能合并为一个 whole.html 文件。还有更好的办法吗?

更新

letters = ['doctor1','doctor2'...]
for i in range(30):
    try:
        page = urllib2.urlopen("http://www.coimbatore.com/doctors/{}.htm".format(letters[i]))
    except urllib2.HTTPError:
        continue
    else:

最佳答案

一种方法是使用 :

创建项目:

scrapy startproject doctors && cd doctors

定义要加载的数据(items.py):

from scrapy.item import Item, Field

class DoctorsItem(Item):
    doctor_name = Field()
    qualification = Field()
    membership = Field()
    visiting_hospitals = Field()
    phone = Field()
    consulting_hours = Field()
    specialist_in = Field()

创建蜘蛛。 基本似乎足以完成任务:

scrapy genspider -t basic doctors_spider 'coimbatore.com'

将其更改为返回一个 Request 对象,直到每个页面都包含医生的信息:

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from doctors.items import DoctorsItem
from scrapy.http import Request
from urlparse import urljoin

class DoctorsSpiderSpider(BaseSpider):
    name = "doctors_spider"
    allowed_domains = ["coimbatore.com"]
    start_urls = [ 
        'http://www.coimbatore.com/doctors/home.htm'
    ]   


    def parse(self, response):
        hxs = HtmlXPathSelector(response)

        for row in hxs.select('/html/body/center[1]/table[@cellpadding = 0]'):
            i = DoctorsItem()
            i['doctor_name'] = '|'.join(row.select('./tr[1]/td[2]//font[@size = -1]/text()').extract()).replace('\n', ' ')
            i['qualification'] ='|'.join( row.select('./tr[2]/td[2]//font[@size = -1]/text()').extract()).replace('\n', ' ')
            i['membership'] = '|'.join(row.select('./tr[3]/td[2]//font[@size = -1]/text()').extract()).replace('\n', ' ')
            i['visiting_hospitals'] = '|'.join(row.select('./tr[4]/td[2]//font[@size = -1]/text()').extract()).replace('\n', ' ')
            i['phone'] = '|'.join(row.select('./tr[5]/td[2]//font[@size = -1]/text()').extract()).replace('\n', ' ')
            i['consulting_hours'] = '|'.join(row.select('./tr[6]/td[2]//font[@size = -1]/text()').extract()).replace('\n', ' ')
            i['specialist_in'] = '|'.join(row.select('./tr[7]/td[2]//font[@size = -1]/text()').extract()).replace('\n', ' ')
            yield i

        for url in hxs.select('/html/body/center[3]//a/@href').extract():
            yield Request(urljoin(response.url, url), callback=self.parse)

        for url in hxs.select('/html/body//a/@href').extract():
            yield Request(urljoin(response.url, url), callback=self.parse)

运行如下:

scrapy crawl doctors_spider -o doctors.csv -t csv

这将创建一个 csv 文件,例如:

phone,membership,visiting_hospitals,qualification,specialist_in,consulting_hours,doctor_name
(H)00966 4 6222245|(R)00966 4 6230143 ,,Domat Al Jandal Hospital|Al Jouf |Kingdom Of Saudi Arabia ,"MBBS, MS, MCh ( Cardio-Thoracic)",Cardio Thoracic Surgery,,Dr. N. Rajaratnam
210075,FRCS(Edinburgh) FIACS,"SRI RAMAKRISHNA HOSPITAL|CHEST CLINIC,COWLEY BROWN ROAD,R.S.PURAM,CBE-2","MD.,DPPR.,FACP",PULMONOLOGY/ RESPIRATORY MEDICINE,"9-1, 5-8",DR.T.MOHAN KUMAR
+91-422-827784-827790,Member -IAPMR,"Kovai Medical Center & Hospital, Avanashi Road,|Coimbatore-641 014","M.B.B.S., Dip.in. Physical Medicine & Rehabilitation","Neck and Back pain, Joint pain, Amputee Rehabilitation,|Spinal cord Injuries & Stroke",9.00am to 5.00pm (Except Sundays),Dr.Edmund M.D'Couto
+91-422-303352,*********,"206, Puliakulam Road, Coimbatore-641 045","M.B.B.S., M.D., D.V.",Sexually Transonitted Diseases.,5.00pm - 7.00pm,Dr.M.Govindaswamy
...

关于python - 使用 BS4 和 python 从 HTML 文件目录中抓取网页,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/20146230/

相关文章:

python - 如何在 Python 3.1 中将字符串转换为缓冲区?

python:基于键排序键:值对

linux - 如何监控进程工作目录的变化?

Linux命令解释

python - 无法从网页中抓取静态信息

python - 获取文本并删除所有标签,但保留标题和粗体的标签

python - Beautifulsoup - 删除 HTML 标签

python - 如何打印PCAP文件中的所有目标端口和源端口?

linux - 为什么在 Bash 上使用 Groovy shell 脚本(当内存占用很重要时)?

php - CURL 返回奇怪的字符