Skip to content

Where is the 'catalog'? #308

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
ccancellieri opened this issue Nov 7, 2024 · 8 comments
Closed

Where is the 'catalog'? #308

ccancellieri opened this issue Nov 7, 2024 · 8 comments

Comments

@ccancellieri
Copy link

ccancellieri commented Nov 7, 2024

Describe the bug

From the spec page:

https://github.com/radiantearth/stac-api-spec/blob/v1.0.0/stac-spec/catalog-spec/catalog-spec.md

The standards says:

A STAC Catalog object represents a logical group of other Catalog, [Collection](https://github.com/radiantearth/stac-api-spec/blob/v1.0.0/stac-spec/collection-spec/collection-spec.md), and [Item](https://github.com/radiantearth/stac-api-spec/blob/v1.0.0/stac-spec/item-spec/item-spec.md) objects. These Items can be linked to directly from a Catalog, or the Catalog can link to other Catalogs (often called sub-catalogs) that contain links to Collections and Items. The division of sub-catalogs is up to the implementor, but is generally done to aid the ease of online browsing by people.

A Catalog object will typically be the entry point into a STAC catalog. Their purpose is discovery: to be browsed by people or be crawled by clients to build a searchable index.

Any JSON object that contains all the required fields is a valid STAC Catalog object.

So I'm wondering why we can't have more than one catalog on a single ES instance.
I think this is a major issue as it may also change all the rest url paths.

Expected behavior
I would expect to have a first level catalog, then the collections, so we can have multiple catalogs per ES instance.

@faoFurkanMacit
Copy link

+1

@jamesfisher-geo
Copy link
Collaborator

jamesfisher-geo commented Nov 7, 2024

The Catalog object is available at the API landing page.

STAC API only supports a single Catalog because it is optimized for search. Basically a single STAC API is a single Catalog.

To group Items within STAC API, you should use Collections or give the Items a common field that you can filter with the /search route.

@jamesfisher-geo
Copy link
Collaborator

jamesfisher-geo commented Nov 7, 2024

The core STAC API spec is available here: https://github.com/radiantearth/stac-api-spec/tree/release/v1.0.0/core#core

@ccancellieri
Copy link
Author

Thanks but I don't see how this can be used in a real environment where you pay for resources (production I mean)

The Catalog object is available at the API landing page.

STAC API only supports a single Catalog because it is optimized for search. Basically a single STAC API is a single Catalog.

This optimization is provided out of the box from ES, we should concentrate here on functionalities don't you think so?
We have like thousands of collections already organized in catalogs and having them in a single bucket is not an option, especially because the first element of a STAC catalog is the catalog itself.
Is this application a proposal or something which could be used in production? I don't see how this could be used as it is, so at this point we (UN FAO) will have to fork.

To group Items within STAC API, you should use Collections or give the Items a common field that you can filter with the /search route.

But you are also not providing a collection search?
So you are suggesting to inject at item level a property to indicate which item belongs to which catalog, we have more than a milion of items, this information may spread on all of them overloading es indexes and memory for what?

Wouldn't be better to have a logical level on top of this to properly implement the catalog (not only looking at the core implementation?)?

@StijnCaerts
Copy link
Collaborator

Most STAC API implementations (or at least the ones starting from stac-fastapi) follow this fixed-level tree structure.

API / Catalog
└── collections
    └── items

If you want to add support for nested catalogs, you'll have to make some drastic changes to the way the API is organized. This has also consequences for API extensions like the transaction, collection-transaction, ...

@jonhealy1
Copy link
Collaborator

I would be interested in exploring ideas on how to explore multiple catalogs or even nested catalogs with a stac-api. Maybe if a STAC catalog could be searched in the api the same way a STAC collection is, then catalogs would just be searched in the same way ie localhost:8000/collections/catalog1. Determining what collections belong to what catalog can be done by examining the links of a particular collection or catalog. The root / route of the stac api would be like a catalog of all catalogs so to speak.

@jamesfisher-geo
Copy link
Collaborator

This optimization is provided out of the box from ES, we should concentrate here on functionalities don't you think so?

Yes. If we revisit the Catalog spec:

{
    "stac_version": "1.0.0",
    "type": "Catalog",
    "id": "20201211_223832_CS2",
    "description": "A simple catalog example",
    "links": []
}

A Collection is the same as a Catalog, except that is has additional fields for license, summaries, and extent (spatial and temporal). The extent field makes Collections better for spatio-temporal search.

{
    "stac_version": "1.0.0",
    "type": "Collection",
    "license": "ISC",
    "id": "20201211_223832_CS2",
    "description": "A simple collection example",
    "links": [],
    "extent": {},
    "summaries": {}
}

I don't see how adding multi-Catalog support would increase performance. Additional Catalogs would add additional indices to ES/OS, but the documents inside the Catalog are just pointers to the Collection indices. All of the STAC data will still stored ina single index per STAC Collection.

We have like thousands of collections already organized in catalogs and having them in a single bucket is not an option, especially because the first element of a STAC catalog is the catalog itself.

STAC API takes advantage of horizontal scaling of ES/OS. There is a single collections index, and a separate index for each collection. You can then manage the number of nodes and shards in your cluster to make sure your ES/OS cluster can handle that volume of data.

@ccancellieri
Copy link
Author

Thanks all for your thoughts

The problems I'm mentioning are not related to performance but to an organisation of the catalog and the costs of the services, an ES instance on the cloud costs quite a lot and managing a single catalog over it leads to several organisational problems in our case and it won't be an option to mix all the collections in one big catalog, we will need to split the service over several well separated catalogs.

Looking at that collections-0001 hardcoded I'm thinking to create several collections indexes making this parametric.
I'm also considering to have multiple sub-modules https://fastapi.tiangolo.com/advanced/sub-applications/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants