python - 检查记录是否存在于 MongoDB 中

我正在构建一个 MongoDB 数据库，但问题是我想避免重复条目。目前我正在这样做(仅在检查条目是否不存在后才插入文档):

from pymongo import Connection 
import pandas as pd
from time import strftime
from collections import OrderedDict

connection = Connection()
db = connection.mydb 
collection = db.mycollection

data = pd.read_csv("data/myfile.csv", parse_dates=[2,5])

for i in range(len(data)):
    if(collection.find({ "id":     data.ix[0],                         \
                         "date1":  data.ix[i, 2].strftime("%Y-%m-%d"), \
                         "date2":  data.ix[i, 5].strftime("%Y-%m-%d"), \
                         "number": int(data.ix[i, 6]),                 \
                         "type":   data.ix[i, 7]}).count() == 0):
        collection.insert(here goes what I'd like to insert)

这确实工作正常，但是这已经存在严重的性能问题(只有约 100Mb 的数据)，因为每次执行 find() 似乎都会显着减慢速度。

有没有办法加快速度？也许我从根本上做错了？我需要避免仅在一组特定字段上重复，而不是所有字段(即，还有“number2”，它可以不同，但如果所有其他字段匹配，我仍然希望将其作为重复项)。

最佳答案

您可以构建一个 unique index在您正在搜索的字段上(mongo shell 语法):

db.mycollection.ensureIndex({_id:1, date1:1, date2:1, number:1, type:1}, {unique: true});

并在插入重复项时捕获违反约束的异常(并在适当时忽略它)。

通常这会提高性能，因为重复检查是通过索引查找完成的。

关于python - 检查记录是否存在于 MongoDB 中，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/16889666/

上一篇：python - matplotlib 中带有双 for 循环的动画图

下一篇：python - 在 Python 中追加递归方法的最佳方式是什么？

相关文章：

java - 由于将请求从老板线程传递到工作线程而导致的 netty 延迟？

performance - Fortran/Python/MATLAB 之间 MKL 矩阵乘法性能的特殊差异

javascript - Mongoose : Query for starts with

python - 将 python 列表值写入 csv 文件

python - 如何从图像中删除水平和垂直线

python - 如何根据部分字符串匹配来过滤字典？

javascript - Mongoose 查询返回未定义的结果

python - 使用列表作为类参数

c# - 访问修饰符会影响性能吗？

group-by - 在 MongoDB GROUP BY 中进行 HAVING 的正确方法是什么？