python - 有没有更好的方法来抓取这些数据?

标签 python csv web-scraping beautifulsoup

为了工作,我被要求创建一个包含美国所有对抗疗法医学院的名称和地址的电子表格。作为 python 的新手,我认为这将是尝试网络抓取的完美情况。虽然我最终编写了一个返回我需要的数据的程序,但我知道有更好的方法可以做到这一点,因为我必须进入 excel 并手动删除一些无关的字符(例如:“、]、[)。我只是想知道是否有更好的方法来编写这段代码,这样我就可以得到我需要的东西,减去多余的字符。

编辑:我还附上了 csv 文件的图像,该图像是为显示我所说的无关字符而创建的。

from bs4 import BeautifulSoup
import requests
import csv  

link = "https://members.aamc.org/eweb/DynamicPage.aspx?site=AAMC&webcode=AAMCOrgSearchResult&orgtype=Medical%20School" # noqa
# link to the site we want to scrape from

page_response = requests.get(link)
# fetching the content using the requests library

soup = BeautifulSoup(page_response.text, "html.parser")
# Calling BeautifulSoup in order to parse our document

data = []
# Empty list for the first scrape. We only get one column with many rows. 
# We still have the line break tags here </br>
for tr in soup.find_all('tr', {'valign': 'top'}):
    values = [td.get_text('</b>', strip=True) for td in tr.find_all('td')]
    data.append(values)

data2 = []
# New list that we'll use to have name on index i, address on index i+1
for i in data:
    test = list(str(i).split('</b>'))
    # Using the line breaks to our advantage. 
    name = test[0].strip("['")
    '''Here we are saying that the name of the school is the first element
       before the first line break'''

    addy = test[1:]
    # The address is what comes after this first line break
    data2.append(name)
    data2.append(addy)
    # Append the name of the school and address to our new list.

school_name = data2[::2]
# Making a new list that consists of the school name
school_address = data2[1::2]
# Another list that consists of the school's address.

with open("Medschooltest.csv", 'w', encoding='utf-8') as toWrite:
    writer = csv.writer(toWrite)
    writer.writerows(zip(school_name, school_address))
    '''Zip the two together making a 2 column table with the schools name and
       it's address'''

print("CSV Completed!")

Created CSV file

最佳答案

似乎应用条件语句和字符串操作可以达到目的。我认为以下脚本会让您真正接近您想要的。

from bs4 import BeautifulSoup
import requests
import csv

link = "https://members.aamc.org/eweb/DynamicPage.aspx?site=AAMC&webcode=AAMCOrgSearchResult&orgtype=Medical%20School" # noqa

res = requests.get(link)
soup = BeautifulSoup(res.text, "html.parser")

with open("membersInfo.csv","w",newline="") as infile:
    writer = csv.writer(infile)
    writer.writerow(["Name","Address"])

    for tr in soup.find_all('table', class_='bodyTXT'):
        items = ', '.join([item.string for item in tr.select_one('td') if item.string!="\n" and item.string!=None])
        name = items.split(",")[0].strip()
        address = items.split(name)[1].strip(",")
        writer.writerow([name,address])

关于python - 有没有更好的方法来抓取这些数据?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/53756440/

相关文章:

python - numpy 矩阵乘法每列和总和

python - 试图绘制温度

php - 要使用 CSV 或 XML 进行数据导入?

python - 使用 Networkx 绘制格子和图形

scala - 在 Apache Spark 中对 RDD 进行分区,使得一个分区包含在一个文件中

python - 我有一个带有输入的 csv,我想要输出 csv。输入生成一些网址,我想将它们附加到现有数据框

javascript - Python Web Scraper - 页面 JavaScript 定义的每页结果有限

json - 如何从 url 中抓取 JSON 文件并打印内容

python - urllib.request SSL 连接 Python 3

python - 使用 Python 打开网页时如何覆盖 Windows 7 中的默认浏览器选择