python - 本地 HTML 文件抓取 Urllib 和 BeautifulSoup

标签 python loops web-scraping beautifulsoup urllib

我对 python 非常陌生,并且已经从头开始编写以下代码两周来抓取本地文件。大概花了近一百个小时尽可能多地学习 Python、版本性、导入包(例如 lxml、bs4、requests、urllib、os、glob 等)。

我绝望地停留在第一部分,即在一个目录中获取 12,000 个名称奇怪的 HTML 文件,然后使用 BeautifulSoup 进行加载和解析。我想将所有这些数据放入 csv 文件中或只是输出,以便我可以使用剪贴板将其复制到文件中。

import bs4
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup

#THIS LOCAL FILE WORKS PERFECTLY. I HAVE 12,000 HTML FILES IN THIS DIRECTORY TO PROCESS.  HOW?
#my_url = 'file://127.0.0.1/C:\\My Web Sites\\BioFachURLS\\www.organic-bio.com\\en\\company\\1-SUNRISE-FARMS.html'
my_url = 'http://www.organic-bio.com/en/company/23694-MARTHOMI-ALLERGY-FREE-FOODS-GMBH'

uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()

# html parsing
page_soup = soup(page_html, "html.parser")

# grabs each field
contactname = page_soup.findAll("td", {"itemprop": "name"})
contactstreetaddress = page_soup.findAll("td", {"itemprop": "streetAddress"})
contactpostalcode = page_soup.findAll("td", {"itemprop": "postalCode"})
contactaddressregion = page_soup.findAll("td", {"itemprop": "addressRegion"})
contactaddresscountry = page_soup.findAll("td", {"itemprop": "addressCountry"})
contactfax = page_soup.findAll("td", {"itemprop": "faxNumber"})
contactemail = page_soup.findAll("td", {"itemprop": "email"})
contactphone = page_soup.findAll("td", {"itemprop": "telephone"})
contacturl = page_soup.findAll("a", {"itemprop": "url"})

#Outputs as text without tags
Company = contactname[0].text
Address = contactstreetaddress[0].text
Zip = contactpostalcode[0].text
Region = contactaddressregion[0].text
Country = contactaddresscountry[0].text
Fax = contactfax[0].text
Email = contactemail[0].text
Phone = contactphone[0].text
URL = contacturl[0].text

#Prints with comma delimiters

print(Company + ', ' + Address + ', ' + Zip + ', ' + Region + ', ' + Country + ', ' + Fax + ', ' + Email + ', ' + URL)

最佳答案

我之前曾处理过包含大量文件的文件夹,因此我可以提供一些帮助。

我们将从文件夹中的文件开始 for 循环

import os
from bs4 import BeautifulSoup as page_soup

phone = [] # A list to store all the phone
path = 'yourpath' # This is your folder name which stores all your html 
#be careful that you might need to put a full path such as C:\Users\Niche\Desktop\htmlfolder 
for filename in os.listdir(path): #Read files from your path

    #Here we are trying to find the full pathname
    for x in filename: #We will have A-H stored as path
        subpath = os.path.join(path, filename) 
        for filename in os.listdir(subpath):
        #Getting the full path of a particular html file
            fullpath = os.path.join(subpath, filename)
            #If we have html tag, then read it
            if fullpath.endswith('.html'): continue
            #Then we will run beautifulsoup to extract the contents
            soup = page_soup(open(fullpath), 'html.parser')
            #Then run your code
            # grabs each field
            contactname = page_soup.findAll("td", {"itemprop": "name"})
            contactstreetaddress = page_soup.findAll("td", {"itemprop": "streetAddress"})
            contactpostalcode = page_soup.findAll("td", {"itemprop": "postalCode"})
            contactaddressregion = page_soup.findAll("td", {"itemprop": "addressRegion"})
            contactaddresscountry = page_soup.findAll("td", {"itemprop": "addressCountry"})
            contactfax = page_soup.findAll("td", {"itemprop": "faxNumber"})
            contactemail = page_soup.findAll("td", {"itemprop": "email"})
            contactphone = page_soup.findAll("td", {"itemprop": "telephone"})
            contacturl = page_soup.findAll("a", {"itemprop": "url"})

            #Outputs as text without tags
            Company = contactname[0].text
            Address = contactstreetaddress[0].text
            Zip = contactpostalcode[0].text
            Region = contactaddressregion[0].text
            Country = contactaddresscountry[0].text
            Fax = contactfax[0].text
            Email = contactemail[0].text
            Phone = contactphone[0].text
            URL = contacturl[0].text
            #Here you might want to consider using dictionary or a list
            #For example append Phone to list call phone
            phone.append(Phone)

代码有点乱,但它遍历了所有可能的文件夹(即使你的主文件夹中有其他文件夹),然后尝试找到 html 标签,然后打开它。

我建议使用以公司为键的字典,其中我假设公司名称不同。一堆列表也很棒,因为您的值将相应地排序。我不太擅长字典,所以我不能给你更多的建议。我希望我能回答你的问题。

P.S 抱歉,代码很乱。

编辑:修复用 html.parser 替换 lxml

关于python - 本地 HTML 文件抓取 Urllib 和 BeautifulSoup,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/43148784/

相关文章:

python - Tensorflow 实时对象检测 - 需要优化建议

JavaScript - Map() 增量值

java - PrintWriter 在我的循环中不起作用,但我的循环正在工作

javascript - 如何计算JavaScript数组中变量重复的次数?

python - beautifulsoup 4 : Segmentation fault (core dumped)

php - 蜘蛛和刮刀架构

python - 如何使用 Selenium 浏览整个网站?

python - 扭曲的 python : pausing a serial port read

python - python 中的 C 风格转义

python - 检查字典中的多个键