strategies for adding my own data to use in a Planetary Computer analysis? #15

erthward · 2021-12-15T00:06:28Z

erthward
Dec 15, 2021

Hi, Planetary Computer folks!

I apologize for the greenhorn question! I just want to make sure I understand my options clearly before I get started possibly using the Planetary Computer for an analysis.

The Background: I do a lot of scientific computing, and I have worked on traditional campus compute clusters for years. However, I am new to the cloud world, I am a scientific programmer rather than a trained software developer, and most of my cloud geocomputation thus far as been on Google Earth Engine. GEE is great, but it is of course pretty inflexible. For an analysis I'm currently working on I want to wrap our relatively straightforward spatial analysis in a Monte Carlo-based uncertainty analysis. GEE won't do it, and I don't have access to a 'brick and mortar' supercomputer for this, so I'm looking at options for cloud compute cluster where I can develop and execute this very parallelizable MC analysis. This at first seemed like a great excuse to start using my newly approved Planetary Computer access. However, the analysis requires the use of moderate-res (~30m and ~100m) global raster datasets that are not offered in the Planetary Computer data catalog. In GEE in the past, I have worked with datasets like this by ingesting them using a JSON manifest and getting access to the resulting asset, which GEE then chunks and parallelizes under the hood as it interprets and runs my client-side code. The Planetary Computer is a totally different paradigm though, so I cannot tell if/how I can do something analogous...

The Question(s): Is there a reasonable way to add these pretty large and currently tiled datasets to Azure somewhere, then draw on them, in combination with other data currently in the Planetary Computer data catalog, to run my analysis? Would this just be a matter of creating separate Azure storage elsewhere, choosing a cloud-appropriate storage format for my external data, converting them, uploading them there, and then figuring out code to access them within the same script where I'm also accessing Planetary Computer datasets? Or is there an easier way? Are there Planetary Computer utilities to make a workflow like this relatively quick and painless (something analogous to JSON manifest ingestion for GEE assets)? Or am I totally off-target and is this not even doable, or at least not recommendable?

As you can surely tell, things are pretty foggy for me at the moment. Thanks for any light you can help shed!

Drew

Answered by TomAugspurger

Dec 15, 2021

First, which dataset is it that you're working with? If it's a public dataset that's generally useful, we could add it to our backlog of datasets to onboard, which might be easiest for everyone.

Next, this is an area that we're looking to improve. We're hoping that someday you can just "upload your data to the Planetary Computer", and then it's treated like any other dataset in our catalog, only private to you. But we're a long ways from that, so your stuck on your own for now.

The current best option would be to

Upload your dataset to a Blob Storage container in Azure's West Europe region (where our data / compute is). Ideally in some cloud-friendly format like COGs / Zarr. Depending on…

View full answer

TomAugspurger · 2021-12-15T03:04:30Z

TomAugspurger
Dec 15, 2021

First, which dataset is it that you're working with? If it's a public dataset that's generally useful, we could add it to our backlog of datasets to onboard, which might be easiest for everyone.

Next, this is an area that we're looking to improve. We're hoping that someday you can just "upload your data to the Planetary Computer", and then it's treated like any other dataset in our catalog, only private to you. But we're a long ways from that, so your stuck on your own for now.

The current best option would be to

Upload your dataset to a Blob Storage container in Azure's West Europe region (where our data / compute is). Ideally in some cloud-friendly format like COGs / Zarr. Depending on the source, azcopy might be helpful.
Access the data from your Jupyter server / Dask workers using a SAS token

https://planetarycomputer.microsoft.com/docs/quickstarts/storage/ has a short example going through this.

One missing part here is the STAC metadata / API, which enables searching on the Planetary Computer. If your analysis is just mapping some MC over each tile, then lack of search might not be a big deal.

8 replies

TomAugspurger Jul 28, 2022

Do you have an idea of the rough number of STAC items you have? As you've noticed, setting up a STAC API would be a decent amount of effort so we should verify that you actually need it first.

If

You don't have too many items
You're accessing / searching the items using a sophisticated client like Python or R

then one option would be to just store all the STAC items as something like ndjson where all the items' JSON is in a single file. Then you could load up that single file with something like geopandas and do your spatial querying using .cx or .intersects. We don't have any examples of that, but I could put something together if it'd be helpful.

erthward Jul 28, 2022
Author

Oh, that's a great idea! Thanks! My main concern was that I don't want to be stuck using the individual tile files as the unit of iteration, but instead want to be able iterate over chunks of an arbitrary size and stream into my analysis the COG contents that overlap with each chunk, as needed. This should still allow this, right? (Sorry! Still getting used to the cloud-native paradigm. Everything's new to me!)

We definitely won't have anything on the order of the big remote sensing archives. Our data are all global datasets at different resolutions. I expect to have at least: 1 dataset of ~500 3-band rasters, 1 dataset of ~650 2-band rasters, and a few other much coarser-res datasets saved as single files. There is a chance that I'll actually wind up adding up to a few other 500-ish raster datasets. Aside from that, I may also wind up using some of the public archives that are already in the MSPC Catalog, but that's unrelated. So, let's say my private STAC will be on the order of a few thousand items? Should be pretty doable using that approach I imagine, yeah? If so then I'm happy to start hacking up a prototype and let you know if I have any questions.

Thanks again!

TomAugspurger Jul 28, 2022

This should still allow this, right?

I believe so. From the point of view of a downstream consumer of STAC items (e.g. a library like [stackstac])(https://stackstac.readthedocs.io/en/latest/), it doesn't matter whether the items come from an API or are loaded from a static file system.

erthward Jul 28, 2022
Author

That makes sense to me. I'll come back at this tomorrow and see what I can figure out. (I was honestly surprised that building a STAC API was the only approach I was coming across, because I found that I was able to use eodag to host the STAC JSON file on my local server and then do spatiotemporal search on it, so I was expecting to be able to do something similar on MSPC. (Perhaps I should actually even just do that exact thing? Just felt hackier to do that on MSPC than on my laptop, for some reason.)

TomAugspurger Jul 29, 2022

I haven't used EODAG before, so I'm not sure what it's doing internally. But yes, starting your own server (either on the Hub or somewhere else in / outside of Azure) would be an option. If you're starting the eodag server on the Hub, you'll be able to access it from the Hub at localhost and externally using jupyter-server-proxy.

coupster74 · 2022-03-29T14:17:20Z

coupster74
Mar 29, 2022

Tom,
I'm new to MPC, and have started working with it for a project I am supporting. I had the exact same question and found this post quite useful - thank you!

I'd like to follow up on the STAC comment. We have a large volume of proprietary satellite data (in a few different types) and am receiving new files daily. We are considering architecture options and have prioritized MCP. We've installed the stand-alone DASK environment and am we are storing our files on blob storage, so we are making good progress.

I'm now investigating how to expose the data sets best to our data scientists. I like the idea of a STAC service for our datasets, and we are interested in still accessing MCPs. Do you know if STAC aggregation is possible - where our STAC server could aggregate results with your STAC server? I am just starting to investigate STAC so you might have recommendations on potential approaches for what I'm trying to accomplish.

1 reply

TomAugspurger Mar 30, 2022

Do you know if STAC aggregation is possible - where our STAC server could aggregate results with your STAC server?

I'm guessing it's possible (where your STAC service proxies requests for specific collections to the Planetary Computer) but I don't think it would be 100% straightforward. I'm not aware of any STAC backends that do this transparently, but https://stacindex.org/ecosystem might be work browsing.

So most likely your users will be writing queries against multiple STAC APIs, and will have to know the right API to use for each dataset.

As I mentioned earlier, we are exploring the other side: rather than you setting up a private API and proxying our public catalog, we would be able to host private collections. But that's still some way off.

erthward · 2022-08-04T01:49:00Z

erthward
Aug 4, 2022
Author

Hey again, Tom. So, I've been messing around quite a bit and I am feeling uncertain if I'm making way too complicated, particularly since I'm new to STAC and cloud geocompute. I figured maybe I can run my current workflow past you and you can point out where I'm going astray? The approach I'm currently messing with is: 1. collapse all STAC JSON files into a single ndjson file (check; wrote a simple Py script for this) 2. read that ndjson file into a geopandas GeoDataFrame (check; simple to do, and I've got code to tweak/reformat columns as needed, e.g., datetime) 3. apply a function to the rows of that dataframe that converts each row into a pystac.ItemCollection of pystac.Item objects, setting asset hrefs to their URLs in Azure storage with individually-created SAS tokens concatenated onto the ends (this all appears to work fine, but I really have no idea if I'm even close to the proper way to handle creating and assigning SAS tokens to allow the Azure blobs to be streamed into the analysis later on...) 4. use stackstac.stack to create an xarray lazy DataArray from that (this succeeds, and the DataArray mostly makes sense, although the time axis is looking strange at the moment...) 5. for the moment, I'm just doing a simple stand-in analysis using xarray operations, then calling .compute() to save the resulting small AOI to a data variable, then trying to call .plot.imshow() on the result My expectation, if this all works out as planned, is that dask will stream the data from the COGs referenced in the ItemCollection's assets into my result, as required based on the COGs' spatial overlap with my AOI, then plot the map. Instead, the dask.diagnostics.ProgressBar() is never even created, the Jupyter notebook's Python kernel is suspended for a good while, then it eventually dies without giving any result. So... First attempt: a glorious dumpster fire. :) Would appreciate any thoughts or suggestions if you have them! Thanks again, Tom! Drew On Fri, Jul 29, 2022 at 10:50 AM Drew Terasaki Hart < ***@***.***> wrote:

…

Great! I'll mess around with these couple approaches, then let you know if I have any questions. Thanks again for the quick responses and constant help, Tom! It means a lot. Drew On Fri, Jul 29, 2022 at 7:55 AM Tom Augspurger ***@***.***> wrote: > I haven't used EODAG before, so I'm not sure what it's doing internally. > But yes, starting your own server (either on the Hub or somewhere else in / > outside of Azure) would be an option. If you're starting the eodag server > on the Hub, you'll be able to access it *from* the Hub at localhost and > externally using jupyter-server-proxy > <https://planetarycomputer.microsoft.com/docs/overview/environment/#accessing-other-processes-and-services> > . > > — > Reply to this email directly, view it on GitHub > <#15 (reply in thread)>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/ABOSTJR5I5FDAFE22VQCILLVWPBCRANCNFSM5KCIOR7Q> > . > You are receiving this because you authored the thread.Message ID: > ***@***.*** > com> >

0 replies

erthward · 2022-08-04T22:13:06Z

erthward
Aug 4, 2022
Author

Hey again, Tom. Sorry for the back-to-back messages, but I figured I ought to keep this thread up to date to keep track as I'm working through things (but if that's annoying then let me know and I'll stop). I have made some progress both simplifying the workflow I outlined above and honing in on specific problems. I'll note them here, in case you have any input/advice on the remaining sticking points: - By making the Azure container temporarily publicly readable (it's all publicly available data that I'm using anyhow...) I was able to determine that some of the problems I was getting earlier (e.g., unable to run [*collection.get_all_items()]) appear to be related to invalid SAS tokens. I can work on figuring this out down the line, but definitely resolvable and not the critical path at the moment. - Once I resolved that then I was able to just create a pystac.Collection instance from file, call collection.make_all_asset_hrefs_absolute method, then feed [*collection.get_all_items()] into pystac.ItemCollection into stackstac.stack to get a lazy xarray.DataArray with correct-appearing metadata and dims, etc. - The weirdness of the 'time' dim is just because all my data is contemporaneous but 'time' is the default first dim output by stackstac.stack (rtfd!), so I can use the mean method with skipna=True or just stackstac.mosaic to collapse the time dim. - At this point, however, whether or not I first set up a dask cluster, and whether or not the cluster is autoscaling, I see the same behavior that I was not expecting: Even subsetting just a very small section of that DataArray (even a section that is all inside a single raw COG), as soon as I call mosaic, compute, plot.imshow, etc. then the python kernel dies (and if I monitor the cluster on the dashboard nothing ever even happens). No error is ever thrown, and identical code works fine using an ItemCollection pulled from the STAC API of some publicly-available remote sensing archive, even MODIS stored on AWS somewhere. The total size of DataArray that's causing the problem is much larger (~5 PiB) than the example MODIS DataArray I mentioned (~100s GiB), so I suppose that that must be the root of the problem, but because it's lazy my naive understanding was that the summary information held in memory still wouldn't pose a serious memory issue. If so then perhaps I just need to be actually creating and processing the DataArray itself in chunks in parallel? I'll continue chipping away at this. It's a learning process. Meanwhile, any input appreciated, of course. And thanks again! On Wed, Aug 3, 2022 at 9:48 PM Drew Terasaki Hart < ***@***.***> wrote:

…

Hey again, Tom. So, I've been messing around quite a bit and I am feeling uncertain if I'm making way too complicated, particularly since I'm new to STAC and cloud geocompute. I figured maybe I can run my current workflow past you and you can point out where I'm going astray? The approach I'm currently messing with is: 1. collapse all STAC JSON files into a single ndjson file (check; wrote a simple Py script for this) 2. read that ndjson file into a geopandas GeoDataFrame (check; simple to do, and I've got code to tweak/reformat columns as needed, e.g., datetime) 3. apply a function to the rows of that dataframe that converts each row into a pystac.ItemCollection of pystac.Item objects, setting asset hrefs to their URLs in Azure storage with individually-created SAS tokens concatenated onto the ends (this all appears to work fine, but I really have no idea if I'm even close to the proper way to handle creating and assigning SAS tokens to allow the Azure blobs to be streamed into the analysis later on...) 4. use stackstac.stack to create an xarray lazy DataArray from that (this succeeds, and the DataArray mostly makes sense, although the time axis is looking strange at the moment...) 5. for the moment, I'm just doing a simple stand-in analysis using xarray operations, then calling .compute() to save the resulting small AOI to a data variable, then trying to call .plot.imshow() on the result My expectation, if this all works out as planned, is that dask will stream the data from the COGs referenced in the ItemCollection's assets into my result, as required based on the COGs' spatial overlap with my AOI, then plot the map. Instead, the dask.diagnostics.ProgressBar() is never even created, the Jupyter notebook's Python kernel is suspended for a good while, then it eventually dies without giving any result. So... First attempt: a glorious dumpster fire. :) Would appreciate any thoughts or suggestions if you have them! Thanks again, Tom! Drew On Fri, Jul 29, 2022 at 10:50 AM Drew Terasaki Hart < ***@***.***> wrote: > Great! I'll mess around with these couple approaches, then let you know > if I have any questions. > > Thanks again for the quick responses and constant help, Tom! It means a > lot. > > Drew > > > > On Fri, Jul 29, 2022 at 7:55 AM Tom Augspurger ***@***.***> > wrote: > >> I haven't used EODAG before, so I'm not sure what it's doing internally. >> But yes, starting your own server (either on the Hub or somewhere else in / >> outside of Azure) would be an option. If you're starting the eodag server >> on the Hub, you'll be able to access it *from* the Hub at localhost and >> externally using jupyter-server-proxy >> <https://planetarycomputer.microsoft.com/docs/overview/environment/#accessing-other-processes-and-services> >> . >> >> — >> Reply to this email directly, view it on GitHub >> <#15 (reply in thread)>, >> or unsubscribe >> <https://github.com/notifications/unsubscribe-auth/ABOSTJR5I5FDAFE22VQCILLVWPBCRANCNFSM5KCIOR7Q> >> . >> You are receiving this because you authored the thread.Message ID: >> ***@***.*** >> .com> >> >

1 reply

TomAugspurger Aug 5, 2022

That all makes sense. A few comments / debugging tips:

invalid SAS tokens.

Good to hear that setting it to public is an option for now. The simplest way to do things with a private storage account is probably to make a read / list SAS token for the entire container, rather than per-blob SAS tokens.

collection.make_all_asset_hrefs_absolute

When you write the STAC items, are the asset hrefs relative or absolute? I'd recommend writing them as absolute (or calling make_all_asset_hrefs_absolute) before writing, so that you only have to do it once.

then the python kernel dies

I'd recommend a few things:

Try loading a single asset with GDAL (maybe even just gdalinfo /vsicurl/https://.... And if that works then rioxarray.open_rasterion(...)).
If loading the asset works with GDAL / rasterio, verify that the STAC metadata matches what you're seeing from GDAL.
Try loading a single item with stackstac, see if that dies.
Try running the stackstac / filter / load locally, Somewhere where you can view the stderr printed by GDAL.

If possible, you could share a STAC item and I could take a look at what's going on with the kernel dying when I have a chance.

erthward · 2022-10-11T07:55:38Z

erthward
Oct 11, 2022
Author

Great! I'll mess around with these couple approaches, then let you know if I have any questions. Thanks again for the quick responses and constant help, Tom! It means a lot. Drew

…

On Fri, Jul 29, 2022 at 7:55 AM Tom Augspurger ***@***.***> wrote: I haven't used EODAG before, so I'm not sure what it's doing internally. But yes, starting your own server (either on the Hub or somewhere else in / outside of Azure) would be an option. If you're starting the eodag server on the Hub, you'll be able to access it *from* the Hub at localhost and externally using jupyter-server-proxy <https://planetarycomputer.microsoft.com/docs/overview/environment/#accessing-other-processes-and-services> . — Reply to this email directly, view it on GitHub <#15 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABOSTJR5I5FDAFE22VQCILLVWPBCRANCNFSM5KCIOR7Q> . You are receiving this because you authored the thread.Message ID: ***@***.*** com>

0 replies

erthward · 2022-10-11T07:56:42Z

erthward
Oct 11, 2022
Author

Hey, Tom. Thanks! These are great suggestions. I somehow missed the fact that I should call make_all_asset_hrefs_absolute before writing out, so I'll go back and build that into my take-2 workflow. The debugging steps were very helpful! I had been trying to run locally too, but hadn't broken it down to reading a length-1 ItemCollection into stackstac.stack. That helped me confirm that everything was working fine at smaller scale and that GDAL and STAC metadata were indeed aligned, and eventually I traced the problem back to a poor-performance default dask chunksize being assigned to the DataArray; I reread the dask docs and wound up realizing that I needed a larger chunk size and chunk size that's a perfect factor of my file size, to reduce overhead and unnecessary computation. I trialed 10K chunks (for my global set of 30m data saved in 40k x 40k files) and it worked like a charm! So, I should be all good to go back, start cleaning this up, and then start working on analysis! Thanks again! Will be sure to let you know if any other roadblocks come up! Drew

…

On Fri, Aug 5, 2022 at 8:03 AM Tom Augspurger ***@***.***> wrote: That all makes sense. A few comments / debugging tips: 1. invalid SAS tokens. Good to hear that setting it to public is an option for now. The simplest way to do things with a private storage account is probably to make a read / list SAS token for the entire container, rather than per-blob SAS tokens. 1. collection.make_all_asset_hrefs_absolute When you write the STAC items, are the asset hrefs relative or absolute? I'd recommend writing them as absolute (or calling make_all_asset_hrefs_absolute) before writing, so that you only have to do it once. 1. then the python kernel dies I'd recommend a few things: - Try loading a single asset with GDAL (maybe even just gdalinfo /vsicurl/https://.... And if that works then rioxarray.open_rasterion(...)). - If loading the asset works with GDAL / rasterio, verify that the STAC metadata matches what you're seeing from GDAL. - Try loading a single item with stackstac, see if that dies. - Try running the stackstac / filter / load locally, Somewhere where you can view the stderr printed by GDAL. — Reply to this email directly, view it on GitHub <#15 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABOSTJWLXCYE5CCBXOLRA33VXT7IVANCNFSM5KCIOR7Q> . You are receiving this because you authored the thread.Message ID: ***@***.*** com>

0 replies

Shashank0510 · 2023-02-09T09:24:39Z

Shashank0510
Feb 9, 2023

Hi Tom,
Not sure if this right thread to ask this question, but just wanted to check is there a way I can upload a directory to Microsoft Planetary Computer Hub ? Thanks

1 reply

TomAugspurger Feb 9, 2023

I think https://jupyterlab.readthedocs.io/en/stable/user/files.html#uploading-and-downloading covers that.

Sbrowneo · 2023-09-13T19:46:49Z

Sbrowneo
Sep 13, 2023

Hi Tom - I'm in a similar situation of OP in terms of needing to use external datasets in MCP. I wasn't able to find any other links or resources aside from this on the subject. This seems really difficult. You mentioned that there was a long-term goal of making this process easier - I was just wondering if we're any closer to making that a reality since October 2022, and if the advice you gave OP still stands as the best method.

Thank you!

3 replies

erthward Sep 13, 2023
Author

Hi there!

I'm glad you brought this up, because it never occurred to me to circle back!

I hadn't seen any explicit tutorials on this anywhere, and it took a while to figure out all the steps, so I wound up writing it all up as a resource. It's embedded as a part of a larger set of notes on cloud computing options for spatial analysis (though the MSPC page is the only one with any thorough content; the others are placeholders). The page specifically focused on getting set up to use your own datasets is this one. (Note that it assumes very little prior research computing experience, so it might feel like it 'overexplains' at times.)

Hope that's helpful, and let me know if you have any questions, or if you wind up finding changes you want me to make, or wanting to contribute.

Drew

Sbrowneo Sep 13, 2023

That's fantastically helpful, thank you!! So glad you took the time to write this up!

TomAugspurger Sep 14, 2023

I can't seem to link directly to the header, but under "What's next for Azure Space" at https://azure.microsoft.com/en-us/blog/accelerating-the-pace-of-innovation-with-azure-space-and-our-partners/, we mention that we're working on a service to help out with this kind of thing. Not ready yet, but stay tuned!

erthward · 2023-09-14T01:10:13Z

erthward
Sep 14, 2023
Author

So glad to be of help! Please let me know if there's anything else I can help with. Happy to pay it forward.

…

On Thu, Sep 14, 2023 at 6:54 AM Sbrowneo ***@***.***> wrote: That's fantastically helpful, thank you!! So glad you took the time to write this up! — Reply to this email directly, view it on GitHub <#15 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABOSTJXQMQBJNUTTIELINRDX2ITSZANCNFSM5KCIOR7Q> . You are receiving this because you authored the thread.Message ID: ***@***.*** com>

0 replies

strategies for adding my own data to use in a Planetary Computer analysis? #15

Uh oh!

Replies: 9 comments · 14 replies

Uh oh!

Uh oh!

Uh oh!

erthward Jul 28, 2022 Author

Uh oh!

Uh oh!

erthward Jul 28, 2022 Author

Uh oh!

Uh oh!

Uh oh!

Uh oh!

erthward Aug 4, 2022 Author

Uh oh!

erthward Aug 4, 2022 Author

Uh oh!

Uh oh!

Uh oh!

erthward Oct 11, 2022 Author

Uh oh!

erthward Oct 11, 2022 Author

Uh oh!

Uh oh!

Uh oh!

Uh oh!

erthward Sep 13, 2023 Author

Uh oh!

Uh oh!

Uh oh!

erthward Sep 14, 2023 Author

Replies: 9 comments 14 replies

erthward Jul 28, 2022
Author

erthward Jul 28, 2022
Author

erthward
Aug 4, 2022
Author

erthward
Aug 4, 2022
Author

erthward
Oct 11, 2022
Author

erthward
Oct 11, 2022
Author

erthward Sep 13, 2023
Author

erthward
Sep 14, 2023
Author