[Feature] Addition of MongoDB Atlas datastore #428

caseyclements · 2024-03-01T18:45:19Z

This pull-request adds MongoDB as a datastore with Atlas Vector Search. Follows references closely.

Detailed setup: docs/providers/mongodb/setup.md
Example notebook: examples/providers/mongodb/semantic-search.ipynb
Integration tests: tests/datastore/providers/mongodb_atlas/test_mongodb_datastore.py

Documentation has also been updated.

@isafulf It is recommended in the PR Checklist that the author request a review from a maintainer. Would you be so kind?

caseyclements · 2024-03-01T18:47:16Z

This PR contains a commit from #427. Would you please merge that one, too?

Co-authored-by: Jib <[email protected]>

…mance.

…nt is

WaVEV · 2024-03-09T21:12:47Z

datastore/providers/mongodb_atlas_datastore.py

+            for chunk in chunk_list:
+                inserted_ids.append(chunk.id)
+                documents_to_upsert.append(
+                        UpdateOne({'_id': chunk.id}, {"$set": chunk.dict()}, upsert=True)


Correct me if I am wrong, but I have something to point out:

The upsert will return the chunk IDs instead of the document IDs. So if I insert N documents, I will get more (or equal if they are tiny) than N IDs. Thus, the docstring isn't right.

If the chunk is a basemodel from Pydantic, use model_dump(). I think dict is deprecated?

I do like the idea; it is a better approach without a doubt. I think it is better to return the document IDs. If the document gets split in the same way, the IDs will be the same, and everything will work.

The other thing is, what is going to happen if a document with 5 chunks is inserted, then modified, but the new document is 2 chunks shorter? I think the old 2 chunks are still existing. So, if a chapter of a document is removed, a query that matches the old part of the document can cause a false match.

@WaVEV RE: return id's. I changed it to be the chunk ids because this is all that the datastore knows about. Itis a simple change to make if need be. As far as I can tell, the ids returned are not used. If there are conventions, we are glad to follow them.

RE: chunk.dict() I inferred the usage from a previous commit of yours which had done similar: "metadata": document_chunk.metadata.dict(). This removes the helper function that you had. MongoDB can consume the data as-is.

prakul · 2024-03-25T17:55:43Z

@ianmobbs @isafulf can we please get a review on this PR

isafulf

Looks great, thanks for adding!

docker compose file.

cbd739a

WaVEV and others added 28 commits March 5, 2024 14:06

search example.

183eaef

mongodb atlas datastore.

674cbeb

refactor, docstring and notebook cleaning.

5f18aaa

docstring.

e8870fb

fix attributes names.

3b51cf3

Functional tests.

d8d3f63

Example adjustement.

2cb1658

setup.md

767e16a

remove some useless comments.

ba973ee

wrong docker image.

1c9d8a9

Minor documentation fixes.

6ceccd0

Update example.

f9c208d

refactor.

76f0bc1

default as a default collection.

fa31996

TODO resolved.

60ccaff

Refactor delete.

8759c78

fix readme and setup.md

ed72267

add warning when delete without criteria.

6f8e98e

rename private function.

8e77bb3

replace pymongo to motor and fix integration test.

00aa4e7

Refactor code and adjust tests

c579764

wait for assert function.

d2f7373

Update docs/providers/mongodb_atlas/setup.md

36fee26

Co-authored-by: Jib <[email protected]>

Update datastore/providers/mongodb_atlas_datastore.py

4988ed0

Co-authored-by: Jib <[email protected]>

Increase oversampling factor to 10.

103923c

Update tests/datastore/providers/mongodb_atlas/test_mongodb_datastore.py

8561c34

Co-authored-by: Jib <[email protected]>

Update tests/datastore/providers/mongodb_atlas/test_mongodb_datastore.py

ed6d0fd

Co-authored-by: Jib <[email protected]>

Update datastore/providers/mongodb_atlas_datastore.py

e973e0e

Co-authored-by: Jib <[email protected]>

WaVEV and others added 15 commits March 5, 2024 14:09

Version added.

a96cae0

Update datastore/providers/mongodb_atlas_datastore.py

53482ee

Co-authored-by: Jib <[email protected]>

Removed _atlas from folder name to keep it simple and self-consistent

f6a7b2b

Expanded setup.md

faf1b98

Fixed a couple typos in docstrings

b4a2058

Add optional EMBEDDING_DIMENSION to get_embedding

ec0bb2c

Fixed typo in kwarg

ee788f6

Extended setup.md

291611f

Edits to environment variable table

a5f0b53

Added authentication token descriptions

81de99c

Removed hardcoded vector size

962cb5d

Added semantic search example

74b137f

Added instructions to integration tests

abe64a4

Cleanup

cb37a7f

Removed pathname from example.

ed38349

caseyclements force-pushed the feature/mongodb-datastore branch from b35eab0 to 3012eba Compare March 9, 2024 18:55

caseyclements added 2 commits March 9, 2024 14:54

Override DataStore.upsert in MongoDBAtlasDataStore to increase perfor…

609d021

…mance.

upsert now returns ids of chunks, which is what each datastore docume…

f6cd748

…nt is

caseyclements force-pushed the feature/mongodb-datastore branch from 3012eba to f6cd748 Compare March 9, 2024 20:38

WaVEV reviewed Mar 9, 2024

View reviewed changes

caseyclements force-pushed the feature/mongodb-datastore branch from 9212ca2 to f6cd748 Compare March 13, 2024 21:03

caseyclements added 2 commits March 16, 2024 00:13

Added full integration test

dd192b3

test_integration now uses FastAPI TestClient

ecf1746

Retries query until response contains number requested

8c8bffa

isafulf approved these changes Apr 24, 2024

View reviewed changes

isafulf merged commit b28ddce into openai:main Apr 24, 2024

This was referenced Apr 25, 2024

Add optional EMBEDDING_DIMENSION to get_embedding #427

Closed

[Bugfix] Fixes broken link in README. Improves naming consistency #435

Open

caseyclements mentioned this pull request May 8, 2024

services.openai.get_embeddings does not expose the dimensions kwarg of openai.Embedding.create #426

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feature] Addition of MongoDB Atlas datastore #428

[Feature] Addition of MongoDB Atlas datastore #428

caseyclements commented Mar 1, 2024

Uh oh!

caseyclements commented Mar 1, 2024

Uh oh!

WaVEV Mar 9, 2024 •

edited

Loading

Uh oh!

WaVEV Mar 10, 2024

Uh oh!

caseyclements Mar 11, 2024

Uh oh!

prakul commented Mar 25, 2024

Uh oh!

isafulf left a comment

Uh oh!

Uh oh!

[Feature] Addition of MongoDB Atlas datastore #428

[Feature] Addition of MongoDB Atlas datastore #428

Conversation

caseyclements commented Mar 1, 2024

Uh oh!

caseyclements commented Mar 1, 2024

Uh oh!

WaVEV Mar 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

WaVEV Mar 10, 2024

Choose a reason for hiding this comment

Uh oh!

caseyclements Mar 11, 2024

Choose a reason for hiding this comment

Uh oh!

prakul commented Mar 25, 2024

Uh oh!

isafulf left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

WaVEV Mar 9, 2024 •

edited

Loading