python - 为什么在 PyMongo 中处理大型 MongoDB 集合时会丢失数据？我该怎么办？

我在处理非常大的 MongoDB 集合(1900 万个文档)时遇到了一些问题。

当我简单地遍历集合时，如下所示，PyMongo 似乎在 10,593,454 个文档后放弃。这似乎是相同的，即使我使用 skip()，集合的后半部分似乎无法以编程方式访问。

#!/usr/bin/env python
import pymongo

client = pymongo.MongoClient()
db = client['mydb']
classification_collection = db["my_classifications"]

print "Collection contains %s documents." % db.command("collstats", "my_classifications")["count"]

for ii, classification in enumerate(classification_collection.find(no_cursor_timeout=True)):
  print "%s: created at %s" % (ii,classification["created_at"])

print "Done."

脚本最初报告:

Collection contains 19036976 documents.

最终，脚本完成了，我没有收到任何错误，并且确实得到了“完成”。信息。但是打印的最后一行是

10593454: created at 2013-12-12 02:17:35

我在过去 2 年中登录的所有记录，最近的记录，似乎都无法访问。有谁知道这里发生了什么？我该怎么办？

最佳答案

好的，感谢 this helpful article我找到了另一种翻阅文档的方法，它似乎不受“丢失数据”/“超时”问题的影响。本质上，您必须使用 find() 和 limit() 并依靠集合的自然 _id 顺序来检索页面中的文档.这是我修改后的代码:

#!/usr/bin/env python
import pymongo

client = pymongo.MongoClient()
db = client['mydb']
classification_collection = db["my_classifications"]

print "Collection contains %s documents." % db.command("collstats", "my_classifications")["count"]

# get first ID
pageSize = 100000
first_classification = classification_collection.find_one()
completed_page_rows=1
last_id = first_classification["_id"]

# get the next page of documents (read-ahead programming style)
next_results = classification_collection.find({"_id":{"$gt":last_id}},{"created_at":1},no_cursor_timeout=True).limit(pageSize)

# keep getting pages until there are no more
while next_results.count()>0:
  for ii, classification in enumerate(next_results):
    completed_page_rows+=1
    if completed_page_rows % pageSize == 0:
      print "%s (id = %s): created at %s" % (completed_page_rows,classification["_id"],classification["created_at"])
    last_id = classification["_id"]
  next_results = classification_collection.find({"_id":{"$gt":last_id}},{"created_at":1},no_cursor_timeout=True).limit(pageSize)

print "\nDone.\n"

我希望通过编写此解决方案，这将帮助遇到此问题的其他人。

注意:这个更新后的 list 也采纳了@Takarii 和@adam-comerford 在评论中的建议，我现在只检索我需要的字段(默认情况下有_id)，我也打印出 ID 以供引用。

关于python - 为什么在 PyMongo 中处理大型 MongoDB 集合时会丢失数据？我该怎么办？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/35552031/

python - 为什么在 PyMongo 中处理大型 MongoDB 集合时会丢失数据？我该怎么办？

上一篇：c# - 使用 C# 查询 MongoDB 嵌套数组文档

下一篇：javascript - Node.JS 查询 MongoDB 返回 null