python - 从 JSON 文件中删除重复条目 - BeautifulSoup

标签 python json beautifulsoup

我正在运行一个脚本来对网站进行 scape 以获取教科书信息,并且该脚本正在运行。但是,当它写入 JSON 文件时,它会给我重复的结果。我想弄清楚如何从 JSON 文件中删除重复项。这是我的代码:

from urllib.request import urlopen
from bs4 import BeautifulSoup as soup
import json

urls = ['https://open.bccampus.ca/find-open-textbooks/', 
'https://open.bccampus.ca/find-open-textbooks/?start=10']

data = []
#opening up connection and grabbing page
for url in urls:
    uClient = urlopen(url)
    page_html = uClient.read()
    uClient.close()

    #html parsing
    page_soup = soup(page_html, "html.parser")

    #grabs info for each textbook
    containers = page_soup.findAll("h4")

    for container in containers:
       item = {}
       item['type'] = "Textbook"
       item['title'] = container.parent.a.text
       item['author'] = container.nextSibling.findNextSibling(text=True)
       item['link'] = "https://open.bccampus.ca/find-open-textbooks/" + container.parent.a["href"]
       item['source'] = "BC Campus"
       data.append(item) # add the item to the list

with open("./json/bc.json", "w") as writeJSON:
    json.dump(data, writeJSON, ensure_ascii=False)

这是 JSON 输出的示例

{
"type": "Textbook",
"title": "Exploring Movie Construction and Production",
"author": " John Reich, SUNY Genesee Community College",
"link": "https://open.bccampus.ca/find-open-textbooks/?uuid=19892992-ae43-48c4-a832-59faa1d7108b&contributor=&keyword=&subject=",
"source": "BC Campus"
}, {
"type": "Textbook",
"title": "Exploring Movie Construction and Production",
"author": " John Reich, SUNY Genesee Community College",
"link": "https://open.bccampus.ca/find-open-textbooks/?uuid=19892992-ae43-48c4-a832-59faa1d7108b&contributor=&keyword=&subject=",
"source": "BC Campus"
}, {
"type": "Textbook",
"title": "Project Management",
"author": " Adrienne Watt",
"link": "https://open.bccampus.ca/find-open-textbooks/?uuid=8678fbae-6724-454c-a796-3c6667d826be&contributor=&keyword=&subject=",
"source": "BC Campus"
}, {
"type": "Textbook",
"title": "Project Management",
"author": " Adrienne Watt",
"link": "https://open.bccampus.ca/find-open-textbooks/?uuid=8678fbae-6724-454c-a796-3c6667d826be&contributor=&keyword=&subject=",
"source": "BC Campus"
}

最佳答案

想通了。这是万一其他人遇到此问题的解决方案:

textbook_list = []
for item in data:
    if item not in textbook_list:
        textbook_list.append(item)

with open("./json/bc.json", "w") as writeJSON:
    json.dump(textbook_list, writeJSON, ensure_ascii=False)

关于python - 从 JSON 文件中删除重复条目 - BeautifulSoup,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/50160675/

相关文章:

python - 如何在 python 套接字中使用浏览器作为客户端?

json - django 休息 api : JSON parse error - No JSON object could be decoded

javascript - 如何使用 beautifulsoup 从 js 和 Reactjs 获取数据?

python - 从特定 xml 节点提取值

python - *不*使用鼠标缩放内联 3D matplotlib 图?

python - 在 Cython 中是否可以使用 C++ 风格的内部类型定义?

javascript - json bool 值转换为字符串

sql - Postgres : Loop through json array equivalent in SQL?

python - Sublime Text 2 上的模块错误

python - 使用 python 库 h5py 获取 h5 文件中的所有键及其层次结构