python - 使用美汤蟒刮痧

标签 python web-scraping beautifulsoup screen-scraping

<div class="members_box_second">
                    <div class="members_box0">
                        <p>1</p>
                    </div>
                    <div class="members_box1">
                        <p class="clear"><b>Name:</b><span>Mr.Jagadhesan.S</span></p>
                        <p class="clear"><b>Designation:</b><span>Proprietor</span></p>
                        <p class="clear"><b>CODISSIA - Designation:</b><span>(Founder President, CODISSIA)</span></p>
                        <p class="clear"><b>Name of the Industry:</b><span>Govardhana Engineering Industries</span></p>
                        <p class="clear"><b>Specification:</b><span>LIFE</span></p>
                        <p class="clear"><b>Date of Admission:</b><span>19.12.1969</span></p>
                    </div>
                    <div class="members_box2">
                        <p>Ukkadam South</p>
                        <p class="clear"><b>Phone:</b><span>2320085, 2320067</span></p>
                        <p class="clear"><b>Email:</b><span><a href="mailto:jagadhesan@infognana.com">jagadhesan@infognana.com</a></span></p>                       
                    </div>
</div>
<div class="members_box">
                    <div class="members_box0">
                        <p>2</p>
                    </div>
                    <div class="members_box1">
                        <p class="clear"><b>Name:</b><span>Mr.Somasundaram.A</span></p>
                        <p class="clear"><b>Designation:</b><span>Proprietor</span></p>

                        <p class="clear"><b>Name of the Industry:</b><span>Everest Engineering Works</span></p>
                        <p class="clear"><b>Specification:</b><span>LIFE</span></p>
                        <p class="clear"><b>Date of Admission:</b><span>19.12.1969</span></p>
                    </div>
                    <div class="members_box2">
                        <p>Alagar Nivas, 284 NSR Road</p>
                        <p class="clear"><b>Phone:</b><span>2435674</span></p>      
                        <h4>Factory Address</h4>
                        Coimbatore - 641 027
                        <p class="clear"><b>Phone:</b><span>2435674</span></p>
                    </div>
</div>

我有上面的结构。从那以后,我试图仅在 class members_box1members_box2div 中抓取文本。

我有以下脚本,它只从 members_box1 获取数据

from bs4 import BeautifulSoup
import urllib2
import csv
import re
page = urllib2.urlopen("http://www.codissia.com/member/members-directory/?mode=paging&Keyword=&Type=&pg=1")
soup = BeautifulSoup(page.read())
for eachuniversity in soup.findAll('div',{'class':'members_box1'}):
    data =  [re.sub('\s+', ' ', text).strip().encode('utf8') for text in eachuniversity.find_all(text=True) if text.strip()]
    print '\n'

这就是我尝试从两个盒子中获取数据的方式

from bs4 import BeautifulSoup
import urllib2
import csv
import re
page = urllib2.urlopen("http://www.codissia.com/member/members-directory/?mode=paging&Keyword=&Type=&pg=1")
soup = BeautifulSoup(page.read())
eachbox2 = soup.findAll('div ', {'class':'members_box2'})
for eachuniversity in soup.findAll('div',{'class':'members_box1'}):
    data =  eachbox2 + [re.sub('\s+', ' ', text).strip().encode('utf8') for text in eachuniversity.find_all(text=True) if text.strip()]
    print data

但我得到的结果与我得到的结果相同 members_box1

更新

我希望迭代的输出像这样(单行)

Name:,Mr.Srinivasan.N,Designation:,Proprietor,CODISSIA - Designation:,(Past President, CODISSIA),Name of the Industry:,Arian Soap Manufacturing Co,Specification:,LIFE,Date of Admission:,19.12.1969, "Parijaat" 26/1Shanker Mutt Road, Basavana Gudi,Phone:,2313861

但是我得到的结果如下

Name:,Mr.Srinivasan.N,Designation:,Proprietor,CODISSIA - Designation:,(Past President, CODISSIA),Name of the Industry:,Arian Soap Manufacturing Co,Specification:,LIFE,Date of Admission:,19.12.1969
"Parijaat" 26/1Shanker Mutt Road, Basavana Gudi,Phone:,2313861

最佳答案

你可以使用 regex匹配 members_box1members_box2 :

import re
eachbox = soup.findAll('div', {'class':re.compile(r'members_box[12]')})
for eachuniversity in eachbox:

例如,

import bs4 as bs
import urllib2
import re
import csv

page = urllib2.urlopen("http://www.codissia.com/member/members-directory/?mode=paging&Keyword=&Type=&pg=1")
content = page.read()
soup = bs.BeautifulSoup(content)

with open('/tmp/ccc.csv', 'wb') as f:
    writer = csv.writer(f, delimiter=',', lineterminator='\n', )
    eachbox = soup.find_all('div', {'class':re.compile(r'members_box[12]')})
    for pair in zip(*[iter(eachbox)]*2):
        writer.writerow([text.strip() for item in pair for text in item.stripped_strings])

请注意,您必须删除 div 之后的杂散空间在

soup.findAll('div ')

为了找到任何<div>标签。


上面的代码使用了非常方便的 grouper idiom :

zip(*[iter(iterable)]*n)

此表达式收集 n来自 iterable 的项目并将它们分组到一个元组中。所以这个表达式允许你遍历 n 的 block 项目。我没有很好地解释 how the grouper idiom works here .

关于python - 使用美汤蟒刮痧,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/19977798/

相关文章:

python - 根据最内维度某个特定索引处的值将张量掩码为 0

python - 动画在 Python 中不起作用

python - 无法通过 BeautifulSoup 抓取

python - 如何在 yattag 中添加提取的 html?

python - 使用BeautifulSoup,如何防止元素找不到?

python - 访问 pandas 数据框中的列时出现问题

python - 将 numpy 矩阵转换为 pandas 数据帧或逐行序列

python - 在使用 Python 和 Beautiful Soup 4 抓取 Twitter 时关注特定结果?

vba - 从超链接图像中提取文件 URL

python - 从 urlReq(url) 中删除 'urllib.error.HTTPError: HTTP Error 302:'