-
Notifications
You must be signed in to change notification settings - Fork 3.7k
[Feature] Addition of MongoDB Atlas datastore #428
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature] Addition of MongoDB Atlas datastore #428
Conversation
This PR contains a commit from #427. Would you please merge that one, too? |
Co-authored-by: Jib <[email protected]>
Co-authored-by: Jib <[email protected]>
Co-authored-by: Jib <[email protected]>
Co-authored-by: Jib <[email protected]>
Co-authored-by: Jib <[email protected]>
Co-authored-by: Jib <[email protected]>
b35eab0
to
3012eba
Compare
3012eba
to
f6cd748
Compare
for chunk in chunk_list: | ||
inserted_ids.append(chunk.id) | ||
documents_to_upsert.append( | ||
UpdateOne({'_id': chunk.id}, {"$set": chunk.dict()}, upsert=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct me if I am wrong, but I have something to point out:
- The upsert will return the chunk IDs instead of the document IDs. So if I insert N documents, I will get more (or equal if they are tiny) than N IDs. Thus, the docstring isn't right.
- If the chunk is a basemodel from Pydantic, use model_dump(). I think dict is deprecated?
I do like the idea; it is a better approach without a doubt. I think it is better to return the document IDs. If the document gets split in the same way, the IDs will be the same, and everything will work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The other thing is, what is going to happen if a document with 5 chunks is inserted, then modified, but the new document is 2 chunks shorter? I think the old 2 chunks are still existing. So, if a chapter of a document is removed, a query that matches the old part of the document can cause a false match.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@WaVEV RE: return id's. I changed it to be the chunk ids because this is all that the datastore knows about. Itis a simple change to make if need be. As far as I can tell, the ids returned are not used. If there are conventions, we are glad to follow them.
RE: chunk.dict()
I inferred the usage from a previous commit of yours which had done similar: "metadata": document_chunk.metadata.dict()
. This removes the helper function that you had. MongoDB can consume the data as-is.
9212ca2
to
f6cd748
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great, thanks for adding!
This pull-request adds MongoDB as a datastore with Atlas Vector Search. Follows references closely.
docs/providers/mongodb/setup.md
examples/providers/mongodb/semantic-search.ipynb
tests/datastore/providers/mongodb_atlas/test_mongodb_datastore.py
Documentation has also been updated.
@isafulf It is recommended in the PR Checklist that the author request a review from a maintainer. Would you be so kind?