python - BeautifulSoup 用 "N/A"填充缺失信息不起作用

标签 python csv beautifulsoup

我正在以下网站上练习我的网页抓取技巧:“http://web.californiacraftbeer.com/Brewery-Member

我目前的代码如下。似乎我得到了正确的公司计数,但我在 CSV 文件中得到了重复的行,我认为只要公司缺少信息就会发生这种情况。在我的代码的多个部分中,我试图用文本“N/A”检测并替换丢失的信息,但它不起作用。我猜这个问题可能与 Zip() 函数有关,但我不确定如何解决它。

非常感谢任何帮助!

"""
Grabs brewery name, contact person, phone number, website address, and email address 
for each brewery listed on the website.
"""

import requests, csv
from bs4 import BeautifulSoup

url = "http://web.californiacraftbeer.com/Brewery-Member"
res = requests.get(url)
soup = BeautifulSoup(res.content, "lxml")
each_company = soup.find_all("div", {"class": "ListingResults_All_CONTAINER ListingResults_Level3_CONTAINER"})
error_msg = "N/A" 

def scraper():
    """Grabs information and writes to CSV"""
    print("Running...")
    results = []
    count = 0

    for info in each_company:
        try:
            company_name = info.find_all("span", itemprop="name")
        except Exception as e:
            company_name = "N/A"
        try:
            contact_name = info.find_all("div", {"class": "ListingResults_Level3_MAINCONTACT"})
        except Exception as e:
            contact_name = "N/A"
        try:
            phone_number = info.find_all("div", {"class": "ListingResults_Level3_PHONE1"})
        except Exception as e:
            phone_number = "N/A"
        try:
            website = info.find_all("span", {"class": "ListingResults_Level3_VISITSITE"})
        except Exception as e:
            website = "N/A"

        for company, contact, phone, site in zip(company_name, contact_name, phone_number, website):
            count += 1
            print("Grabbing {0} ({1})...".format(company.text, count))
            newrow = []
            try:
                newrow.append(company.text)
            except Exception as e:
                newrow.append(error_msg)
            try:
                newrow.append(contact.text)
            except Exception as e:
                newrow.append(error_msg)
            try:
                newrow.append(phone.text)
            except Exception as e:
                newrow.append(error_msg)
            try:
                newrow.append(site.find('a')['href'])
            except Exception as e:
                newrow.append(error_msg)
            try:
                newrow.append("info@" + company.text.replace(" ", "").lower() + ".com")
            except Exception as e:
                newrow.append(error_msg)
        results.append(newrow)

    print("Done")
    outFile = open("brewery.csv", "w")
    out = csv.writer(outFile, delimiter=',',quoting=csv.QUOTE_ALL, lineterminator='\n')
    out.writerows(results)
    outFile.close()

def main():
    """Runs web scraper"""
    scraper()

if __name__ == '__main__':
    main()

最佳答案

来自bs4 docs

"If find_all() can’t find anything, it returns an empty list. If find() can’t find anything, it returns None"

因此,例如,当 company_name = info.find_all("span", itemprop="name") 运行但不匹配任何内容时,它不会抛出异常并且 “NA” 永远不会被设置。

在这种情况下,您需要检查 company_name 是否为空列表:

if not company_name:
    company_name = "N/A"

关于python - BeautifulSoup 用 "N/A"填充缺失信息不起作用,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/42267706/

相关文章:

python - 动态大小的 Python 中的多维数组

java - 我应该如何在本地java应用程序中实现数据库

Python BeautifulSoup 找不到表 ID

python - 在 Pandas 数据框中查找特定部分字符串首次出现的索引位置

python - 在 python 中在 Ubuntu localhost 上运行的调用进程

python - 删除带零的行

python - 使用多个 token 发布 CSRF token ?

python - Beautiful Soup - 获取所有文本,但保留链接 html?

python - 如何为 Django 模型中的外键字段创建可查询的别名?

python - 使用python将带有俄语字符的二维数组打印到csv