我对 python 非常陌生,并且已经从头开始编写以下代码两周来抓取本地文件。大概花了近一百个小时尽可能多地学习 Python、版本性、导入包(例如 lxml、bs4、requests、urllib、os、glob 等)。
我绝望地停留在第一部分,即在一个目录中获取 12,000 个名称奇怪的 HTML 文件,然后使用 BeautifulSoup 进行加载和解析。我想将所有这些数据放入 csv 文件中或只是输出,以便我可以使用剪贴板将其复制到文件中。
import bs4
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
#THIS LOCAL FILE WORKS PERFECTLY. I HAVE 12,000 HTML FILES IN THIS DIRECTORY TO PROCESS. HOW?
#my_url = 'file://127.0.0.1/C:\\My Web Sites\\BioFachURLS\\www.organic-bio.com\\en\\company\\1-SUNRISE-FARMS.html'
my_url = 'http://www.organic-bio.com/en/company/23694-MARTHOMI-ALLERGY-FREE-FOODS-GMBH'
uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()
# html parsing
page_soup = soup(page_html, "html.parser")
# grabs each field
contactname = page_soup.findAll("td", {"itemprop": "name"})
contactstreetaddress = page_soup.findAll("td", {"itemprop": "streetAddress"})
contactpostalcode = page_soup.findAll("td", {"itemprop": "postalCode"})
contactaddressregion = page_soup.findAll("td", {"itemprop": "addressRegion"})
contactaddresscountry = page_soup.findAll("td", {"itemprop": "addressCountry"})
contactfax = page_soup.findAll("td", {"itemprop": "faxNumber"})
contactemail = page_soup.findAll("td", {"itemprop": "email"})
contactphone = page_soup.findAll("td", {"itemprop": "telephone"})
contacturl = page_soup.findAll("a", {"itemprop": "url"})
#Outputs as text without tags
Company = contactname[0].text
Address = contactstreetaddress[0].text
Zip = contactpostalcode[0].text
Region = contactaddressregion[0].text
Country = contactaddresscountry[0].text
Fax = contactfax[0].text
Email = contactemail[0].text
Phone = contactphone[0].text
URL = contacturl[0].text
#Prints with comma delimiters
print(Company + ', ' + Address + ', ' + Zip + ', ' + Region + ', ' + Country + ', ' + Fax + ', ' + Email + ', ' + URL)
最佳答案
我之前曾处理过包含大量文件的文件夹,因此我可以提供一些帮助。
我们将从文件夹中的文件开始 for 循环
import os
from bs4 import BeautifulSoup as page_soup
phone = [] # A list to store all the phone
path = 'yourpath' # This is your folder name which stores all your html
#be careful that you might need to put a full path such as C:\Users\Niche\Desktop\htmlfolder
for filename in os.listdir(path): #Read files from your path
#Here we are trying to find the full pathname
for x in filename: #We will have A-H stored as path
subpath = os.path.join(path, filename)
for filename in os.listdir(subpath):
#Getting the full path of a particular html file
fullpath = os.path.join(subpath, filename)
#If we have html tag, then read it
if fullpath.endswith('.html'): continue
#Then we will run beautifulsoup to extract the contents
soup = page_soup(open(fullpath), 'html.parser')
#Then run your code
# grabs each field
contactname = page_soup.findAll("td", {"itemprop": "name"})
contactstreetaddress = page_soup.findAll("td", {"itemprop": "streetAddress"})
contactpostalcode = page_soup.findAll("td", {"itemprop": "postalCode"})
contactaddressregion = page_soup.findAll("td", {"itemprop": "addressRegion"})
contactaddresscountry = page_soup.findAll("td", {"itemprop": "addressCountry"})
contactfax = page_soup.findAll("td", {"itemprop": "faxNumber"})
contactemail = page_soup.findAll("td", {"itemprop": "email"})
contactphone = page_soup.findAll("td", {"itemprop": "telephone"})
contacturl = page_soup.findAll("a", {"itemprop": "url"})
#Outputs as text without tags
Company = contactname[0].text
Address = contactstreetaddress[0].text
Zip = contactpostalcode[0].text
Region = contactaddressregion[0].text
Country = contactaddresscountry[0].text
Fax = contactfax[0].text
Email = contactemail[0].text
Phone = contactphone[0].text
URL = contacturl[0].text
#Here you might want to consider using dictionary or a list
#For example append Phone to list call phone
phone.append(Phone)
代码有点乱,但它遍历了所有可能的文件夹(即使你的主文件夹中有其他文件夹),然后尝试找到 html 标签,然后打开它。
我建议使用以公司为键的字典,其中我假设公司名称不同。一堆列表也很棒,因为您的值将相应地排序。我不太擅长字典,所以我不能给你更多的建议。我希望我能回答你的问题。
P.S 抱歉,代码很乱。
编辑:修复用 html.parser 替换 lxml
关于python - 本地 HTML 文件抓取 Urllib 和 BeautifulSoup,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/43148784/