Skip to content

[Feature] Addition of MongoDB Atlas datastore #428

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 55 commits into from
Apr 24, 2024

Conversation

caseyclements
Copy link
Contributor

This pull-request adds MongoDB as a datastore with Atlas Vector Search. Follows references closely.

  • Detailed setup: docs/providers/mongodb/setup.md
  • Example notebook: examples/providers/mongodb/semantic-search.ipynb
  • Integration tests: tests/datastore/providers/mongodb_atlas/test_mongodb_datastore.py

Documentation has also been updated.

@isafulf It is recommended in the PR Checklist that the author request a review from a maintainer. Would you be so kind?

@caseyclements
Copy link
Contributor Author

This PR contains a commit from #427. Would you please merge that one, too?

@caseyclements caseyclements force-pushed the feature/mongodb-datastore branch from b35eab0 to 3012eba Compare March 9, 2024 18:55
@caseyclements caseyclements force-pushed the feature/mongodb-datastore branch from 3012eba to f6cd748 Compare March 9, 2024 20:38
for chunk in chunk_list:
inserted_ids.append(chunk.id)
documents_to_upsert.append(
UpdateOne({'_id': chunk.id}, {"$set": chunk.dict()}, upsert=True)
Copy link
Contributor

@WaVEV WaVEV Mar 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct me if I am wrong, but I have something to point out:

  1. The upsert will return the chunk IDs instead of the document IDs. So if I insert N documents, I will get more (or equal if they are tiny) than N IDs. Thus, the docstring isn't right.
  2. If the chunk is a basemodel from Pydantic, use model_dump(). I think dict is deprecated?

I do like the idea; it is a better approach without a doubt. I think it is better to return the document IDs. If the document gets split in the same way, the IDs will be the same, and everything will work.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other thing is, what is going to happen if a document with 5 chunks is inserted, then modified, but the new document is 2 chunks shorter? I think the old 2 chunks are still existing. So, if a chapter of a document is removed, a query that matches the old part of the document can cause a false match.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@WaVEV RE: return id's. I changed it to be the chunk ids because this is all that the datastore knows about. Itis a simple change to make if need be. As far as I can tell, the ids returned are not used. If there are conventions, we are glad to follow them.

RE: chunk.dict() I inferred the usage from a previous commit of yours which had done similar: "metadata": document_chunk.metadata.dict(). This removes the helper function that you had. MongoDB can consume the data as-is.

@caseyclements caseyclements force-pushed the feature/mongodb-datastore branch from 9212ca2 to f6cd748 Compare March 13, 2024 21:03
@prakul
Copy link

prakul commented Mar 25, 2024

@ianmobbs @isafulf can we please get a review on this PR

Copy link
Collaborator

@isafulf isafulf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, thanks for adding!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants