Remote data access tutorial for CMIP6 Zarr data (#132)

weiji14 · jbusecke · dcherian · web-flow · commit f4c7e5189897 · 2022-08-23T17:43:55.000Z
Co-authored-by: Julius Busecke &lt;julius@ldeo.columbia.edu&gt;
Co-authored-by: dcherian &lt;deepak@cherian.net&gt;
Co-authored-by: pre-commit-ci[bot] &lt;66853113+pre-commit-ci[bot]@users.noreply.github.com&gt;
Co-authored-by: Deepak Cherian &lt;dcherian@users.noreply.github.com&gt;
diff --git a/_toc.yml b/_toc.yml
@@ -37,6 +37,7 @@ parts:
       - file: intermediate/xarray_and_dask
       - file: intermediate/xarray_ecosystem
       - file: intermediate/hvplot
+      - file: intermediate/cmip6-cloud
       - file: data_cleaning/ice_velocity.ipynb
 
   - caption: Advanced
diff --git a/intermediate/cmip6-cloud.ipynb b/intermediate/cmip6-cloud.ipynb
@@ -0,0 +1,332 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "0ebd8a4d-6937-4ad6-9c93-fa944fb389c1",
+   "metadata": {},
+   "source": [
+    "# Accessing remote data stored on the cloud\n",
+    "\n",
+    "In this tutorial, we'll cover the following:\n",
+    "- Finding a cloud hosted Zarr archive of CMIP6 dataset(s)\n",
+    "- Remote data access to a single CMIP6 dataset (sea surface height)\n",
+    "- Calculate future predicted sea level change in 2100 compared to 2015"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "b7533f0e-5dd1-423a-9a04-8ed755d180a2",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import gcsfs\n",
+    "import pandas as pd\n",
+    "import xarray as xr"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "95002377-b0a6-479f-928d-53b044b390df",
+   "metadata": {},
+   "source": [
+    "## Finding cloud native data\n",
+    "\n",
+    "Cloud-native data means data that is structured for efficient querying across the network.\n",
+    "Typically, this means having metadata that describes the entire file in the header of the\n",
+    "file, or having a a separate pointer file (so that there is no need to download everything first).\n",
+    "\n",
+    "Quite commonly, you'll see cloud-native datasets stored on these\n",
+    "three object storage providers, though there are many other ones too.\n",
+    "\n",
+    "- [Amazon Simple Storage Service (S3)](https://aws.amazon.com/s3)\n",
+    "- [Azure Blob Storage](https://azure.microsoft.com/en-us/services/storage/blobs)\n",
+    "- [Google Cloud Storage](https://cloud.google.com/storage)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "bc520e32-204f-4f92-bdec-4f678160d6de",
+   "metadata": {},
+   "source": [
+    "### Getting cloud hosted CMIP6 data\n",
+    "\n",
+    "The [Coupled Model Intercomparison Project Phase 6 (CMIP6)](https://en.wikipedia.org/wiki/CMIP6#CMIP_Phase_6)\n",
+    "dataset is a rich archive of modelling experiments carried out to predict the climate change impacts.\n",
+    "The datasets are stored using the [Zarr](https://zarr.dev) format, and we'll go over how to access it.\n",
+    "\n",
+    "Sources:\n",
+    "- https://esgf-node.llnl.gov/search/cmip6/\n",
+    "- CMIP6 data hosted on Google Cloud - https://console.cloud.google.com/marketplace/details/noaa-public/cmip6\n",
+    "- Pangeo/ESGF Cloud Data Access tutorial - https://pangeo-data.github.io/pangeo-cmip6-cloud/accessing_data.html"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "8d12400d-ab5e-420e-b9f5-b61e083dc9ce",
+   "metadata": {},
+   "source": [
+    "First, let's open a CSV containing the list of CMIP6 datasets available"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "d1d9f94c-dbe3-4151-8ee7-fa182724810b",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "df = pd.read_csv(\"https://cmip6.storage.googleapis.com/pangeo-cmip6.csv\")\n",
+    "print(f\"Number of rows: {len(df)}\")\n",
+    "df.head()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "eb263332-dc60-4bd1-9ef3-cf9612cf09a1",
+   "metadata": {},
+   "source": [
+    "Over 5 million rows! Let's filter it down to the variable and experiment\n",
+    "we're interested in, e.g. sea surface height.\n",
+    "\n",
+    "For the `variable_id`, you can look it up given some keyword at\n",
+    "https://docs.google.com/spreadsheets/d/1UUtoz6Ofyjlpx5LdqhKcwHFz2SGoTQV2_yekHyMfL9Y\n",
+    "\n",
+    "For the `experiment_id`, download the spreadsheet from\n",
+    "https://github.com/ES-DOC/esdoc-docs/blob/master/cmip6/experiments/spreadsheet/experiments.xlsx,\n",
+    "go to the 'experiment' tab, and find the one you're interested in.\n",
+    "\n",
+    "Another good place to find the right model runs is https://esgf-node.llnl.gov/search/cmip6\n",
+    "(once you get your head around the acronyms and short names)."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "9b435c14-fd56-481c-b5f4-781794a1cc1a",
+   "metadata": {},
+   "source": [
+    "Below, we'll filter to CMIP6 experiments matching:\n",
+    "- Sea Surface Height Above Geoid [m] (variable_id: `zos`)\n",
+    "- Shared Socioeconomic Pathway 5 (experiment_id: `ssp585`)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "2fe50e53-b02f-4a84-bc4a-e1934fe32661",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "df_zos = df.query(\"variable_id == 'zos' & experiment_id == 'ssp585'\")\n",
+    "df_zos"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "9ddfad3e-d4de-4c0a-be6f-53f1f7928f51",
+   "metadata": {},
+   "source": [
+    "There's 272 modelled scenarios for SSP5.\n",
+    "Let's just get the URL to the first one in the list for now."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "5515186d-8571-439a-b5a8-b8b56aab77f6",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "print(df_zos.zstore.iloc[0])"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "b68bcfbb-24c9-420d-b297-44c678b7f8ce",
+   "metadata": {},
+   "source": [
+    "## Reading from the remote Zarr storage"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "b3f5660d-bd46-44f6-8f6d-a62947b6f2c4",
+   "metadata": {},
+   "source": [
+    "In many cases, you'll need to first connect to the cloud provider.\n",
+    "The CMIP6 dataset allows anonymous access, but for some cases,\n",
+    "you may need to authenticate."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "a4e6d5e3-35a0-4c31-a1b8-96258cf50974",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "fs = gcsfs.GCSFileSystem(token=\"anon\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "b959f829-e434-4a84-82d2-2f2b24dc84d2",
+   "metadata": {},
+   "source": [
+    "Next, we'll need a mapping to the Google Storage object.\n",
+    "This can be done using `fs.get_mapper`.\n",
+    "\n",
+    "A more generic way (for other cloud providers) is to use\n",
+    "[`fsspec.get_mapper`](https://filesystem-spec.readthedocs.io/en/latest/api.html#fsspec.get_mapper) instead."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "e1527d1f-503e-4b0b-8433-794067ed46cc",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "store = fs.get_mapper(\n",
+    "    \"gs://cmip6/CMIP6/ScenarioMIP/NOAA-GFDL/GFDL-ESM4/ssp585/r1i1p1f1/Omon/zos/gn/v20180701/\"\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "b694baac-9259-4de8-8eae-ac3cb653d894",
+   "metadata": {},
+   "source": [
+    "With that, we can open the Zarr store like so."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "74b6d289-a852-4216-a3b6-4483d5ff2854",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "ds = xr.open_zarr(store=store, consolidated=True)\n",
+    "ds"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "d81a5958-517b-4215-8c02-b1083b4b4fe2",
+   "metadata": {},
+   "source": [
+    "### Selecting time slices\n",
+    "\n",
+    "Let's say we want to calculate sea level change between\n",
+    "2015 and 2100. We can access just the specific time points\n",
+    "needed using [`xr.Dataset.sel`](https://docs.xarray.dev/en/stable/generated/xarray.Dataset.sel.html)."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "1101b455-ba65-4cab-a3b6-bf2601958400",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "zos_2015jan = ds.zos.sel(time=\"2015-01-16\").squeeze()\n",
+    "zos_2100dec = ds.zos.sel(time=\"2100-12-16\").squeeze()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "fb8d90a2-9883-41da-b26c-7b5547a15270",
+   "metadata": {},
+   "source": [
+    "Sea level change would just be 2100 minus 2015."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "1f5fa1ee-260c-4ec4-898a-230826f9f2c8",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "sealevelchange = zos_2100dec - zos_2015jan"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "0e087f3b-0315-40db-ae03-a3393b49c30e",
+   "metadata": {},
+   "source": [
+    "Note that up to this point, we have not actually downloaded any\n",
+    "(big) data yet from the cloud. This is all working based on\n",
+    "metadata only.\n",
+    "\n",
+    "To bring the data from the cloud to your local computer, call `.compute`.\n",
+    "This will take a while depending on your connection speed."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "38c2152e-67e7-449e-8f1a-2d64f63dedda",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "sealevelchange = sealevelchange.compute()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "5226729f-07db-4fe6-a980-9a1f630c8277",
+   "metadata": {},
+   "source": [
+    "We can do a quick plot to show how Sea Level is predicted to change\n",
+    "between 2015-2100 (from one modelled experiment)."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "8c42ed9f-fc61-4762-9765-3dd553d5c2ad",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "sealevelchange.plot.imshow()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "b4361786-c889-4ae7-a704-dcbda50513da",
+   "metadata": {},
+   "source": [
+    "Notice the blue parts between -40 and -60 South where sea level has dropped?\n",
+    "That's to do with the Antarctic ice sheet losing mass and resulting in a lower\n",
+    "gravitational pull, resulting in a relative decrease in sea level. Over most\n",
+    "of the Northern Hemisphere though, sea level rise has increased between 2015 and 2100."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "a87aa0a3-c82e-4da0-a5d0-31e42039feae",
+   "metadata": {},
+   "source": [
+    "That's all! Hopefully this will get you started on accessing more cloud-native datasets!"
+   ]
+  }
+ ],
+ "metadata": {
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}