strategies for adding my own data to use in a Planetary Computer analysis? #15
-
Hi, Planetary Computer folks! I apologize for the greenhorn question! I just want to make sure I understand my options clearly before I get started possibly using the Planetary Computer for an analysis. The Background: I do a lot of scientific computing, and I have worked on traditional campus compute clusters for years. However, I am new to the cloud world, I am a scientific programmer rather than a trained software developer, and most of my cloud geocomputation thus far as been on Google Earth Engine. GEE is great, but it is of course pretty inflexible. For an analysis I'm currently working on I want to wrap our relatively straightforward spatial analysis in a Monte Carlo-based uncertainty analysis. GEE won't do it, and I don't have access to a 'brick and mortar' supercomputer for this, so I'm looking at options for cloud compute cluster where I can develop and execute this very parallelizable MC analysis. This at first seemed like a great excuse to start using my newly approved Planetary Computer access. However, the analysis requires the use of moderate-res (~30m and ~100m) global raster datasets that are not offered in the Planetary Computer data catalog. In GEE in the past, I have worked with datasets like this by ingesting them using a JSON manifest and getting access to the resulting asset, which GEE then chunks and parallelizes under the hood as it interprets and runs my client-side code. The Planetary Computer is a totally different paradigm though, so I cannot tell if/how I can do something analogous... The Question(s): Is there a reasonable way to add these pretty large and currently tiled datasets to Azure somewhere, then draw on them, in combination with other data currently in the Planetary Computer data catalog, to run my analysis? Would this just be a matter of creating separate Azure storage elsewhere, choosing a cloud-appropriate storage format for my external data, converting them, uploading them there, and then figuring out code to access them within the same script where I'm also accessing Planetary Computer datasets? Or is there an easier way? Are there Planetary Computer utilities to make a workflow like this relatively quick and painless (something analogous to JSON manifest ingestion for GEE assets)? Or am I totally off-target and is this not even doable, or at least not recommendable? As you can surely tell, things are pretty foggy for me at the moment. Thanks for any light you can help shed! Drew |
Beta Was this translation helpful? Give feedback.
Replies: 9 comments 14 replies
-
First, which dataset is it that you're working with? If it's a public dataset that's generally useful, we could add it to our backlog of datasets to onboard, which might be easiest for everyone. Next, this is an area that we're looking to improve. We're hoping that someday you can just "upload your data to the Planetary Computer", and then it's treated like any other dataset in our catalog, only private to you. But we're a long ways from that, so your stuck on your own for now. The current best option would be to
https://planetarycomputer.microsoft.com/docs/quickstarts/storage/ has a short example going through this. One missing part here is the STAC metadata / API, which enables searching on the Planetary Computer. If your analysis is just mapping some MC over each tile, then lack of search might not be a big deal. |
Beta Was this translation helpful? Give feedback.
-
Tom, I'd like to follow up on the STAC comment. We have a large volume of proprietary satellite data (in a few different types) and am receiving new files daily. We are considering architecture options and have prioritized MCP. We've installed the stand-alone DASK environment and am we are storing our files on blob storage, so we are making good progress. I'm now investigating how to expose the data sets best to our data scientists. I like the idea of a STAC service for our datasets, and we are interested in still accessing MCPs. Do you know if STAC aggregation is possible - where our STAC server could aggregate results with your STAC server? I am just starting to investigate STAC so you might have recommendations on potential approaches for what I'm trying to accomplish. |
Beta Was this translation helpful? Give feedback.
-
Hey again, Tom.
So, I've been messing around quite a bit and I am feeling uncertain if I'm
making way too complicated, particularly since I'm new to STAC and cloud
geocompute. I figured maybe I can run my current workflow past you and you
can point out where I'm going astray? The approach I'm currently messing
with is:
1. collapse all STAC JSON files into a single ndjson file (check; wrote a
simple Py script for this)
2. read that ndjson file into a geopandas GeoDataFrame (check; simple to
do, and I've got code to tweak/reformat columns as needed, e.g., datetime)
3. apply a function to the rows of that dataframe that converts each row
into a pystac.ItemCollection of pystac.Item objects, setting asset hrefs to
their URLs in Azure storage with individually-created SAS tokens
concatenated onto the ends (this all appears to work fine, but I really
have no idea if I'm even close to the proper way to handle creating and
assigning SAS tokens to allow the Azure blobs to be streamed into the
analysis later on...)
4. use stackstac.stack to create an xarray lazy DataArray from that (this
succeeds, and the DataArray mostly makes sense, although the time axis is
looking strange at the moment...)
5. for the moment, I'm just doing a simple stand-in analysis using xarray
operations, then calling .compute() to save the resulting small AOI to a
data variable, then trying to call .plot.imshow() on the result
My expectation, if this all works out as planned, is that dask will stream
the data from the COGs referenced in the ItemCollection's assets into my
result, as required based on the COGs' spatial overlap with my AOI, then
plot the map. Instead, the dask.diagnostics.ProgressBar() is never even
created, the Jupyter notebook's Python kernel is suspended for a good
while, then it eventually dies without giving any result.
So... First attempt: a glorious dumpster fire. :) Would appreciate any
thoughts or suggestions if you have them!
Thanks again, Tom!
Drew
On Fri, Jul 29, 2022 at 10:50 AM Drew Terasaki Hart <
***@***.***> wrote:
… Great! I'll mess around with these couple approaches, then let you know if
I have any questions.
Thanks again for the quick responses and constant help, Tom! It means a
lot.
Drew
On Fri, Jul 29, 2022 at 7:55 AM Tom Augspurger ***@***.***>
wrote:
> I haven't used EODAG before, so I'm not sure what it's doing internally.
> But yes, starting your own server (either on the Hub or somewhere else in /
> outside of Azure) would be an option. If you're starting the eodag server
> on the Hub, you'll be able to access it *from* the Hub at localhost and
> externally using jupyter-server-proxy
> <https://planetarycomputer.microsoft.com/docs/overview/environment/#accessing-other-processes-and-services>
> .
>
> —
> Reply to this email directly, view it on GitHub
> <#15 (reply in thread)>,
> or unsubscribe
> <https://github.com/notifications/unsubscribe-auth/ABOSTJR5I5FDAFE22VQCILLVWPBCRANCNFSM5KCIOR7Q>
> .
> You are receiving this because you authored the thread.Message ID:
> ***@***.***
> com>
>
|
Beta Was this translation helpful? Give feedback.
-
Hey again, Tom.
Sorry for the back-to-back messages, but I figured I ought to keep this
thread up to date to keep track as I'm working through things (but if
that's annoying then let me know and I'll stop). I have made some progress
both simplifying the workflow I outlined above and honing in on specific
problems. I'll note them here, in case you have any input/advice on the
remaining sticking points:
- By making the Azure container temporarily publicly readable (it's all
publicly available data that I'm using anyhow...) I was able to determine
that some of the problems I was getting earlier (e.g., unable to run
[*collection.get_all_items()]) appear to be related to invalid SAS tokens.
I can work on figuring this out down the line, but definitely resolvable
and not the critical path at the moment.
- Once I resolved that then I was able to just create a
pystac.Collection instance from file, call
collection.make_all_asset_hrefs_absolute method, then feed
[*collection.get_all_items()] into pystac.ItemCollection into
stackstac.stack to get a lazy xarray.DataArray with correct-appearing
metadata and dims, etc.
- The weirdness of the 'time' dim is just because all my data is
contemporaneous but 'time' is the default first dim output by
stackstac.stack (rtfd!), so I can use the mean method with skipna=True or
just stackstac.mosaic to collapse the time dim.
- At this point, however, whether or not I first set up a dask cluster,
and whether or not the cluster is autoscaling, I see the same behavior that
I was not expecting: Even subsetting just a very small section of that
DataArray (even a section that is all inside a single raw COG), as soon as
I call mosaic, compute, plot.imshow, etc. then the python kernel dies (and
if I monitor the cluster on the dashboard nothing ever even happens). No
error is ever thrown, and identical code works fine using an ItemCollection
pulled from the STAC API of some publicly-available remote sensing archive,
even MODIS stored on AWS somewhere. The total size of DataArray that's
causing the problem is much larger (~5 PiB) than the example MODIS
DataArray I mentioned (~100s GiB), so I suppose that that must be the root
of the problem, but because it's lazy my naive understanding was that the
summary information held in memory still wouldn't pose a serious memory
issue. If so then perhaps I just need to be actually creating and
processing the DataArray itself in chunks in parallel?
I'll continue chipping away at this. It's a learning process. Meanwhile,
any input appreciated, of course. And thanks again!
On Wed, Aug 3, 2022 at 9:48 PM Drew Terasaki Hart <
***@***.***> wrote:
… Hey again, Tom.
So, I've been messing around quite a bit and I am feeling uncertain if I'm
making way too complicated, particularly since I'm new to STAC and cloud
geocompute. I figured maybe I can run my current workflow past you and you
can point out where I'm going astray? The approach I'm currently messing
with is:
1. collapse all STAC JSON files into a single ndjson file (check; wrote a
simple Py script for this)
2. read that ndjson file into a geopandas GeoDataFrame (check; simple to
do, and I've got code to tweak/reformat columns as needed, e.g., datetime)
3. apply a function to the rows of that dataframe that converts each row
into a pystac.ItemCollection of pystac.Item objects, setting asset hrefs to
their URLs in Azure storage with individually-created SAS tokens
concatenated onto the ends (this all appears to work fine, but I really
have no idea if I'm even close to the proper way to handle creating and
assigning SAS tokens to allow the Azure blobs to be streamed into the
analysis later on...)
4. use stackstac.stack to create an xarray lazy DataArray from that (this
succeeds, and the DataArray mostly makes sense, although the time axis is
looking strange at the moment...)
5. for the moment, I'm just doing a simple stand-in analysis using xarray
operations, then calling .compute() to save the resulting small AOI to a
data variable, then trying to call .plot.imshow() on the result
My expectation, if this all works out as planned, is that dask will stream
the data from the COGs referenced in the ItemCollection's assets into my
result, as required based on the COGs' spatial overlap with my AOI, then
plot the map. Instead, the dask.diagnostics.ProgressBar() is never even
created, the Jupyter notebook's Python kernel is suspended for a good
while, then it eventually dies without giving any result.
So... First attempt: a glorious dumpster fire. :) Would appreciate any
thoughts or suggestions if you have them!
Thanks again, Tom!
Drew
On Fri, Jul 29, 2022 at 10:50 AM Drew Terasaki Hart <
***@***.***> wrote:
> Great! I'll mess around with these couple approaches, then let you know
> if I have any questions.
>
> Thanks again for the quick responses and constant help, Tom! It means a
> lot.
>
> Drew
>
>
>
> On Fri, Jul 29, 2022 at 7:55 AM Tom Augspurger ***@***.***>
> wrote:
>
>> I haven't used EODAG before, so I'm not sure what it's doing internally.
>> But yes, starting your own server (either on the Hub or somewhere else in /
>> outside of Azure) would be an option. If you're starting the eodag server
>> on the Hub, you'll be able to access it *from* the Hub at localhost and
>> externally using jupyter-server-proxy
>> <https://planetarycomputer.microsoft.com/docs/overview/environment/#accessing-other-processes-and-services>
>> .
>>
>> —
>> Reply to this email directly, view it on GitHub
>> <#15 (reply in thread)>,
>> or unsubscribe
>> <https://github.com/notifications/unsubscribe-auth/ABOSTJR5I5FDAFE22VQCILLVWPBCRANCNFSM5KCIOR7Q>
>> .
>> You are receiving this because you authored the thread.Message ID:
>> ***@***.***
>> .com>
>>
>
|
Beta Was this translation helpful? Give feedback.
-
Great! I'll mess around with these couple approaches, then let you know if
I have any questions.
Thanks again for the quick responses and constant help, Tom! It means a lot.
Drew
…On Fri, Jul 29, 2022 at 7:55 AM Tom Augspurger ***@***.***> wrote:
I haven't used EODAG before, so I'm not sure what it's doing internally.
But yes, starting your own server (either on the Hub or somewhere else in /
outside of Azure) would be an option. If you're starting the eodag server
on the Hub, you'll be able to access it *from* the Hub at localhost and
externally using jupyter-server-proxy
<https://planetarycomputer.microsoft.com/docs/overview/environment/#accessing-other-processes-and-services>
.
—
Reply to this email directly, view it on GitHub
<#15 (reply in thread)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABOSTJR5I5FDAFE22VQCILLVWPBCRANCNFSM5KCIOR7Q>
.
You are receiving this because you authored the thread.Message ID:
***@***.***
com>
|
Beta Was this translation helpful? Give feedback.
-
Hey, Tom.
Thanks! These are great suggestions. I somehow missed the fact that I
should call make_all_asset_hrefs_absolute before writing out, so I'll go
back and build that into my take-2 workflow.
The debugging steps were very helpful! I had been trying to run locally
too, but hadn't broken it down to reading a length-1 ItemCollection into
stackstac.stack. That helped me confirm that everything was working fine at
smaller scale and that GDAL and STAC metadata were indeed aligned, and
eventually I traced the problem back to a poor-performance default dask
chunksize being assigned to the DataArray; I reread the dask docs and wound
up realizing that I needed a larger chunk size and chunk size that's a
perfect factor of my file size, to reduce overhead and unnecessary
computation. I trialed 10K chunks (for my global set of 30m data saved in
40k x 40k files) and it worked like a charm!
So, I should be all good to go back, start cleaning this up, and then start
working on analysis!
Thanks again! Will be sure to let you know if any other roadblocks come up!
Drew
…On Fri, Aug 5, 2022 at 8:03 AM Tom Augspurger ***@***.***> wrote:
That all makes sense. A few comments / debugging tips:
1.
invalid SAS tokens.
Good to hear that setting it to public is an option for now. The simplest
way to do things with a private storage account is probably to make a read
/ list SAS token for the entire container, rather than per-blob SAS tokens.
1.
collection.make_all_asset_hrefs_absolute
When you write the STAC items, are the asset hrefs relative or absolute?
I'd recommend writing them as absolute (or calling
make_all_asset_hrefs_absolute) before writing, so that you only have to
do it once.
1.
then the python kernel dies
I'd recommend a few things:
- Try loading a single asset with GDAL (maybe even just gdalinfo
/vsicurl/https://.... And if that works then
rioxarray.open_rasterion(...)).
- If loading the asset works with GDAL / rasterio, verify that the
STAC metadata matches what you're seeing from GDAL.
- Try loading a single item with stackstac, see if that dies.
- Try running the stackstac / filter / load locally, Somewhere where
you can view the stderr printed by GDAL.
—
Reply to this email directly, view it on GitHub
<#15 (reply in thread)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABOSTJWLXCYE5CCBXOLRA33VXT7IVANCNFSM5KCIOR7Q>
.
You are receiving this because you authored the thread.Message ID:
***@***.***
com>
|
Beta Was this translation helpful? Give feedback.
-
Hi Tom, |
Beta Was this translation helpful? Give feedback.
-
Hi Tom - I'm in a similar situation of OP in terms of needing to use external datasets in MCP. I wasn't able to find any other links or resources aside from this on the subject. This seems really difficult. You mentioned that there was a long-term goal of making this process easier - I was just wondering if we're any closer to making that a reality since October 2022, and if the advice you gave OP still stands as the best method. Thank you! |
Beta Was this translation helpful? Give feedback.
-
So glad to be of help! Please let me know if there's anything else I can
help with. Happy to pay it forward.
…On Thu, Sep 14, 2023 at 6:54 AM Sbrowneo ***@***.***> wrote:
That's fantastically helpful, thank you!! So glad you took the time to
write this up!
—
Reply to this email directly, view it on GitHub
<#15 (reply in thread)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABOSTJXQMQBJNUTTIELINRDX2ITSZANCNFSM5KCIOR7Q>
.
You are receiving this because you authored the thread.Message ID:
***@***.***
com>
|
Beta Was this translation helpful? Give feedback.
First, which dataset is it that you're working with? If it's a public dataset that's generally useful, we could add it to our backlog of datasets to onboard, which might be easiest for everyone.
Next, this is an area that we're looking to improve. We're hoping that someday you can just "upload your data to the Planetary Computer", and then it's treated like any other dataset in our catalog, only private to you. But we're a long ways from that, so your stuck on your own for now.
The current best option would be to