Skip to content

Commit f4c7e51

Browse files
weiji14jbuseckedcherianpre-commit-ci[bot]
authored
Remote data access tutorial for CMIP6 Zarr data (#132)
Co-authored-by: Julius Busecke <[email protected]> Co-authored-by: dcherian <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Deepak Cherian <[email protected]>
1 parent cc61978 commit f4c7e51

File tree

2 files changed

+333
-0
lines changed

2 files changed

+333
-0
lines changed

_toc.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -37,6 +37,7 @@ parts:
3737
- file: intermediate/xarray_and_dask
3838
- file: intermediate/xarray_ecosystem
3939
- file: intermediate/hvplot
40+
- file: intermediate/cmip6-cloud
4041
- file: data_cleaning/ice_velocity.ipynb
4142

4243
- caption: Advanced

intermediate/cmip6-cloud.ipynb

Lines changed: 332 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,332 @@
1+
{
2+
"cells": [
3+
{
4+
"cell_type": "markdown",
5+
"id": "0ebd8a4d-6937-4ad6-9c93-fa944fb389c1",
6+
"metadata": {},
7+
"source": [
8+
"# Accessing remote data stored on the cloud\n",
9+
"\n",
10+
"In this tutorial, we'll cover the following:\n",
11+
"- Finding a cloud hosted Zarr archive of CMIP6 dataset(s)\n",
12+
"- Remote data access to a single CMIP6 dataset (sea surface height)\n",
13+
"- Calculate future predicted sea level change in 2100 compared to 2015"
14+
]
15+
},
16+
{
17+
"cell_type": "code",
18+
"execution_count": null,
19+
"id": "b7533f0e-5dd1-423a-9a04-8ed755d180a2",
20+
"metadata": {},
21+
"outputs": [],
22+
"source": [
23+
"import gcsfs\n",
24+
"import pandas as pd\n",
25+
"import xarray as xr"
26+
]
27+
},
28+
{
29+
"cell_type": "markdown",
30+
"id": "95002377-b0a6-479f-928d-53b044b390df",
31+
"metadata": {},
32+
"source": [
33+
"## Finding cloud native data\n",
34+
"\n",
35+
"Cloud-native data means data that is structured for efficient querying across the network.\n",
36+
"Typically, this means having metadata that describes the entire file in the header of the\n",
37+
"file, or having a a separate pointer file (so that there is no need to download everything first).\n",
38+
"\n",
39+
"Quite commonly, you'll see cloud-native datasets stored on these\n",
40+
"three object storage providers, though there are many other ones too.\n",
41+
"\n",
42+
"- [Amazon Simple Storage Service (S3)](https://aws.amazon.com/s3)\n",
43+
"- [Azure Blob Storage](https://azure.microsoft.com/en-us/services/storage/blobs)\n",
44+
"- [Google Cloud Storage](https://cloud.google.com/storage)"
45+
]
46+
},
47+
{
48+
"cell_type": "markdown",
49+
"id": "bc520e32-204f-4f92-bdec-4f678160d6de",
50+
"metadata": {},
51+
"source": [
52+
"### Getting cloud hosted CMIP6 data\n",
53+
"\n",
54+
"The [Coupled Model Intercomparison Project Phase 6 (CMIP6)](https://en.wikipedia.org/wiki/CMIP6#CMIP_Phase_6)\n",
55+
"dataset is a rich archive of modelling experiments carried out to predict the climate change impacts.\n",
56+
"The datasets are stored using the [Zarr](https://zarr.dev) format, and we'll go over how to access it.\n",
57+
"\n",
58+
"Sources:\n",
59+
"- https://esgf-node.llnl.gov/search/cmip6/\n",
60+
"- CMIP6 data hosted on Google Cloud - https://console.cloud.google.com/marketplace/details/noaa-public/cmip6\n",
61+
"- Pangeo/ESGF Cloud Data Access tutorial - https://pangeo-data.github.io/pangeo-cmip6-cloud/accessing_data.html"
62+
]
63+
},
64+
{
65+
"cell_type": "markdown",
66+
"id": "8d12400d-ab5e-420e-b9f5-b61e083dc9ce",
67+
"metadata": {},
68+
"source": [
69+
"First, let's open a CSV containing the list of CMIP6 datasets available"
70+
]
71+
},
72+
{
73+
"cell_type": "code",
74+
"execution_count": null,
75+
"id": "d1d9f94c-dbe3-4151-8ee7-fa182724810b",
76+
"metadata": {},
77+
"outputs": [],
78+
"source": [
79+
"df = pd.read_csv(\"https://cmip6.storage.googleapis.com/pangeo-cmip6.csv\")\n",
80+
"print(f\"Number of rows: {len(df)}\")\n",
81+
"df.head()"
82+
]
83+
},
84+
{
85+
"cell_type": "markdown",
86+
"id": "eb263332-dc60-4bd1-9ef3-cf9612cf09a1",
87+
"metadata": {},
88+
"source": [
89+
"Over 5 million rows! Let's filter it down to the variable and experiment\n",
90+
"we're interested in, e.g. sea surface height.\n",
91+
"\n",
92+
"For the `variable_id`, you can look it up given some keyword at\n",
93+
"https://docs.google.com/spreadsheets/d/1UUtoz6Ofyjlpx5LdqhKcwHFz2SGoTQV2_yekHyMfL9Y\n",
94+
"\n",
95+
"For the `experiment_id`, download the spreadsheet from\n",
96+
"https://github.com/ES-DOC/esdoc-docs/blob/master/cmip6/experiments/spreadsheet/experiments.xlsx,\n",
97+
"go to the 'experiment' tab, and find the one you're interested in.\n",
98+
"\n",
99+
"Another good place to find the right model runs is https://esgf-node.llnl.gov/search/cmip6\n",
100+
"(once you get your head around the acronyms and short names)."
101+
]
102+
},
103+
{
104+
"cell_type": "markdown",
105+
"id": "9b435c14-fd56-481c-b5f4-781794a1cc1a",
106+
"metadata": {},
107+
"source": [
108+
"Below, we'll filter to CMIP6 experiments matching:\n",
109+
"- Sea Surface Height Above Geoid [m] (variable_id: `zos`)\n",
110+
"- Shared Socioeconomic Pathway 5 (experiment_id: `ssp585`)"
111+
]
112+
},
113+
{
114+
"cell_type": "code",
115+
"execution_count": null,
116+
"id": "2fe50e53-b02f-4a84-bc4a-e1934fe32661",
117+
"metadata": {},
118+
"outputs": [],
119+
"source": [
120+
"df_zos = df.query(\"variable_id == 'zos' & experiment_id == 'ssp585'\")\n",
121+
"df_zos"
122+
]
123+
},
124+
{
125+
"cell_type": "markdown",
126+
"id": "9ddfad3e-d4de-4c0a-be6f-53f1f7928f51",
127+
"metadata": {},
128+
"source": [
129+
"There's 272 modelled scenarios for SSP5.\n",
130+
"Let's just get the URL to the first one in the list for now."
131+
]
132+
},
133+
{
134+
"cell_type": "code",
135+
"execution_count": null,
136+
"id": "5515186d-8571-439a-b5a8-b8b56aab77f6",
137+
"metadata": {},
138+
"outputs": [],
139+
"source": [
140+
"print(df_zos.zstore.iloc[0])"
141+
]
142+
},
143+
{
144+
"cell_type": "markdown",
145+
"id": "b68bcfbb-24c9-420d-b297-44c678b7f8ce",
146+
"metadata": {},
147+
"source": [
148+
"## Reading from the remote Zarr storage"
149+
]
150+
},
151+
{
152+
"cell_type": "markdown",
153+
"id": "b3f5660d-bd46-44f6-8f6d-a62947b6f2c4",
154+
"metadata": {},
155+
"source": [
156+
"In many cases, you'll need to first connect to the cloud provider.\n",
157+
"The CMIP6 dataset allows anonymous access, but for some cases,\n",
158+
"you may need to authenticate."
159+
]
160+
},
161+
{
162+
"cell_type": "code",
163+
"execution_count": null,
164+
"id": "a4e6d5e3-35a0-4c31-a1b8-96258cf50974",
165+
"metadata": {},
166+
"outputs": [],
167+
"source": [
168+
"fs = gcsfs.GCSFileSystem(token=\"anon\")"
169+
]
170+
},
171+
{
172+
"cell_type": "markdown",
173+
"id": "b959f829-e434-4a84-82d2-2f2b24dc84d2",
174+
"metadata": {},
175+
"source": [
176+
"Next, we'll need a mapping to the Google Storage object.\n",
177+
"This can be done using `fs.get_mapper`.\n",
178+
"\n",
179+
"A more generic way (for other cloud providers) is to use\n",
180+
"[`fsspec.get_mapper`](https://filesystem-spec.readthedocs.io/en/latest/api.html#fsspec.get_mapper) instead."
181+
]
182+
},
183+
{
184+
"cell_type": "code",
185+
"execution_count": null,
186+
"id": "e1527d1f-503e-4b0b-8433-794067ed46cc",
187+
"metadata": {},
188+
"outputs": [],
189+
"source": [
190+
"store = fs.get_mapper(\n",
191+
" \"gs://cmip6/CMIP6/ScenarioMIP/NOAA-GFDL/GFDL-ESM4/ssp585/r1i1p1f1/Omon/zos/gn/v20180701/\"\n",
192+
")"
193+
]
194+
},
195+
{
196+
"cell_type": "markdown",
197+
"id": "b694baac-9259-4de8-8eae-ac3cb653d894",
198+
"metadata": {},
199+
"source": [
200+
"With that, we can open the Zarr store like so."
201+
]
202+
},
203+
{
204+
"cell_type": "code",
205+
"execution_count": null,
206+
"id": "74b6d289-a852-4216-a3b6-4483d5ff2854",
207+
"metadata": {},
208+
"outputs": [],
209+
"source": [
210+
"ds = xr.open_zarr(store=store, consolidated=True)\n",
211+
"ds"
212+
]
213+
},
214+
{
215+
"cell_type": "markdown",
216+
"id": "d81a5958-517b-4215-8c02-b1083b4b4fe2",
217+
"metadata": {},
218+
"source": [
219+
"### Selecting time slices\n",
220+
"\n",
221+
"Let's say we want to calculate sea level change between\n",
222+
"2015 and 2100. We can access just the specific time points\n",
223+
"needed using [`xr.Dataset.sel`](https://docs.xarray.dev/en/stable/generated/xarray.Dataset.sel.html)."
224+
]
225+
},
226+
{
227+
"cell_type": "code",
228+
"execution_count": null,
229+
"id": "1101b455-ba65-4cab-a3b6-bf2601958400",
230+
"metadata": {},
231+
"outputs": [],
232+
"source": [
233+
"zos_2015jan = ds.zos.sel(time=\"2015-01-16\").squeeze()\n",
234+
"zos_2100dec = ds.zos.sel(time=\"2100-12-16\").squeeze()"
235+
]
236+
},
237+
{
238+
"cell_type": "markdown",
239+
"id": "fb8d90a2-9883-41da-b26c-7b5547a15270",
240+
"metadata": {},
241+
"source": [
242+
"Sea level change would just be 2100 minus 2015."
243+
]
244+
},
245+
{
246+
"cell_type": "code",
247+
"execution_count": null,
248+
"id": "1f5fa1ee-260c-4ec4-898a-230826f9f2c8",
249+
"metadata": {},
250+
"outputs": [],
251+
"source": [
252+
"sealevelchange = zos_2100dec - zos_2015jan"
253+
]
254+
},
255+
{
256+
"cell_type": "markdown",
257+
"id": "0e087f3b-0315-40db-ae03-a3393b49c30e",
258+
"metadata": {},
259+
"source": [
260+
"Note that up to this point, we have not actually downloaded any\n",
261+
"(big) data yet from the cloud. This is all working based on\n",
262+
"metadata only.\n",
263+
"\n",
264+
"To bring the data from the cloud to your local computer, call `.compute`.\n",
265+
"This will take a while depending on your connection speed."
266+
]
267+
},
268+
{
269+
"cell_type": "code",
270+
"execution_count": null,
271+
"id": "38c2152e-67e7-449e-8f1a-2d64f63dedda",
272+
"metadata": {},
273+
"outputs": [],
274+
"source": [
275+
"sealevelchange = sealevelchange.compute()"
276+
]
277+
},
278+
{
279+
"cell_type": "markdown",
280+
"id": "5226729f-07db-4fe6-a980-9a1f630c8277",
281+
"metadata": {},
282+
"source": [
283+
"We can do a quick plot to show how Sea Level is predicted to change\n",
284+
"between 2015-2100 (from one modelled experiment)."
285+
]
286+
},
287+
{
288+
"cell_type": "code",
289+
"execution_count": null,
290+
"id": "8c42ed9f-fc61-4762-9765-3dd553d5c2ad",
291+
"metadata": {},
292+
"outputs": [],
293+
"source": [
294+
"sealevelchange.plot.imshow()"
295+
]
296+
},
297+
{
298+
"cell_type": "markdown",
299+
"id": "b4361786-c889-4ae7-a704-dcbda50513da",
300+
"metadata": {},
301+
"source": [
302+
"Notice the blue parts between -40 and -60 South where sea level has dropped?\n",
303+
"That's to do with the Antarctic ice sheet losing mass and resulting in a lower\n",
304+
"gravitational pull, resulting in a relative decrease in sea level. Over most\n",
305+
"of the Northern Hemisphere though, sea level rise has increased between 2015 and 2100."
306+
]
307+
},
308+
{
309+
"cell_type": "markdown",
310+
"id": "a87aa0a3-c82e-4da0-a5d0-31e42039feae",
311+
"metadata": {},
312+
"source": [
313+
"That's all! Hopefully this will get you started on accessing more cloud-native datasets!"
314+
]
315+
}
316+
],
317+
"metadata": {
318+
"language_info": {
319+
"codemirror_mode": {
320+
"name": "ipython",
321+
"version": 3
322+
},
323+
"file_extension": ".py",
324+
"mimetype": "text/x-python",
325+
"name": "python",
326+
"nbconvert_exporter": "python",
327+
"pygments_lexer": "ipython3"
328+
}
329+
},
330+
"nbformat": 4,
331+
"nbformat_minor": 5
332+
}

0 commit comments

Comments
 (0)