我正在运行一个脚本来对网站进行 scape 以获取教科书信息,并且该脚本正在运行。但是,当它写入 JSON 文件时,它会给我重复的结果。我想弄清楚如何从 JSON 文件中删除重复项。这是我的代码:
from urllib.request import urlopen
from bs4 import BeautifulSoup as soup
import json
urls = ['https://open.bccampus.ca/find-open-textbooks/',
'https://open.bccampus.ca/find-open-textbooks/?start=10']
data = []
#opening up connection and grabbing page
for url in urls:
uClient = urlopen(url)
page_html = uClient.read()
uClient.close()
#html parsing
page_soup = soup(page_html, "html.parser")
#grabs info for each textbook
containers = page_soup.findAll("h4")
for container in containers:
item = {}
item['type'] = "Textbook"
item['title'] = container.parent.a.text
item['author'] = container.nextSibling.findNextSibling(text=True)
item['link'] = "https://open.bccampus.ca/find-open-textbooks/" + container.parent.a["href"]
item['source'] = "BC Campus"
data.append(item) # add the item to the list
with open("./json/bc.json", "w") as writeJSON:
json.dump(data, writeJSON, ensure_ascii=False)
这是 JSON 输出的示例
{
"type": "Textbook",
"title": "Exploring Movie Construction and Production",
"author": " John Reich, SUNY Genesee Community College",
"link": "https://open.bccampus.ca/find-open-textbooks/?uuid=19892992-ae43-48c4-a832-59faa1d7108b&contributor=&keyword=&subject=",
"source": "BC Campus"
}, {
"type": "Textbook",
"title": "Exploring Movie Construction and Production",
"author": " John Reich, SUNY Genesee Community College",
"link": "https://open.bccampus.ca/find-open-textbooks/?uuid=19892992-ae43-48c4-a832-59faa1d7108b&contributor=&keyword=&subject=",
"source": "BC Campus"
}, {
"type": "Textbook",
"title": "Project Management",
"author": " Adrienne Watt",
"link": "https://open.bccampus.ca/find-open-textbooks/?uuid=8678fbae-6724-454c-a796-3c6667d826be&contributor=&keyword=&subject=",
"source": "BC Campus"
}, {
"type": "Textbook",
"title": "Project Management",
"author": " Adrienne Watt",
"link": "https://open.bccampus.ca/find-open-textbooks/?uuid=8678fbae-6724-454c-a796-3c6667d826be&contributor=&keyword=&subject=",
"source": "BC Campus"
}
最佳答案
想通了。这是万一其他人遇到此问题的解决方案:
textbook_list = []
for item in data:
if item not in textbook_list:
textbook_list.append(item)
with open("./json/bc.json", "w") as writeJSON:
json.dump(textbook_list, writeJSON, ensure_ascii=False)
关于python - 从 JSON 文件中删除重复条目 - BeautifulSoup,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/50160675/