-
Notifications
You must be signed in to change notification settings - Fork 0
NcML File Loader for Iris
This page describes an experimental NcML file format handler for the python-based Iris scientific data analysis package.
The NetCDF Markup Language (NcML) defines a mechanism for combining data distributed across multiple
input files (or URL sources). Iris currently supports a number of NcML-like features; for example, the way in which Iris can aggregate data from several input files into a single cubelist is comparable to NcML's union
feature.
To try out the NcML file loader, create a new git remote pointing at URL git://github.com/rockdoc/iris.git, then check out the branch named ncml-two
from this new remote repository. Here's how to do that, assuming you already have a local clone of the Iris repository:
$ git remote add rockdoc git://github.com:rockdoc/iris.git
$ git fetch rockdoc
$ git branch -r
$ git checkout rockdoc/ncml-two
Caveat: Although the NcML loader supports many of the core data aggregation capabilities specified in the NcML standard, it's definitely got some gaps and rough edges. Hopefully these can be addressed in future releases.
On the plus side, the NcML loader only creates an in-memory representation of the data (i.e. as collections of Iris cubes) so it shouldn't be possible to inadvertently modify your existing data files!
The NcML file loader is not enabled by default. To activate it for use, issue the following Python statements:
>>> from iris.experimental.fileformats import register_format_handler
>>> register_format_handler('ncml')
Registered handler for file format: ncml
The NcML loader should be able to deal with many of the data aggregation and metadata manipulation tasks described in the NcML tutorial. Below you'll find a selection of examples of some handy things you can do using the NcML loader.
You might also want to check out the NcML cookbook.
Define one or more data cubes directly within an NcML file
In this example, all the data is defined within a single NcML file. There are no references to external data files. This provides a fairly easy means of serializing 'toy' Iris cubes for, say, testing purposes, or for exploring the Iris environment interactively.
<?xml version="1.0" encoding="UTF-8"?>
<netcdf xmlns="http://www.unidata.ucar.edu/namespaces/netcdf/ncml-2.2">
<!-- define dimensions -->
<dimension name="lat" length="3"/>
<dimension name="lon" length="6"/>
<dimension name="time" length="2"/>
<!-- define coordinate variables for each of the above dimensions -->
<variable name="lat" type="float" shape="lat">
<attribute name="standard_name" type="string" value="latitude"/>
<attribute name="units" type="string" value="degrees_north"/>
<values start="0.0" increment="10.0" npts="3"/>
</variable>
<variable name="lon" type="float" shape="lon">
<attribute name="standard_name" type="string" value="longitude"/>
<attribute name="units" type="string" value="degrees_east"/>
<values start="0.0" increment="10.0" npts="6"/>
</variable>
<variable name="time" type="int" shape="time">
<attribute name="standard_name" type="string" value="time"/>
<attribute name="units" type="string" value="days since 2000-01-01 0:0:0"/>
<attribute name="calendar" type="string" value="360_day"/>
<values>15 45</values>
</variable>
<!-- define a temperature variable dimensioned (time, lat, lon) -->
<variable name="tas" type="float" shape="time lat lon">
<attribute name="standard_name" type="string" value="air_temperature"/>
<attribute name="units" type="string" value="celsius"/>
<values start="1.0" increment="1.0" npts="36"/>
</variable>
<!-- define a precipitation variable dimensioned (lat, lon) -->
<variable name="precip" type="float" shape="lat lon">
<attribute name="standard_name" type="string" value="rainfall_amount"/>
<attribute name="long_name" type="string" value="Precipitation"/>
<attribute name="units" type="string" value="kg m-2"/>
<values start="0.0" increment="0.1" npts="18"/>
</variable>
<attribute name="title" type="string" value="Self-contained NcML file">
</netcdf>
If this NcML snippet was saved in a file called pure_data.ncml
then the following Iris statement would return a cubelist containing 2 cubes:
>>> cubes = iris.load('pure_data.ncml')
>>> print cubes
0: air_temperature / (K) (time: 2; latitude: 3; longitude: 6)
1: rainfall_amount / (kg m-2) (latitude: 3; longitude: 6)
Simple union of data variables from multiple input sources
NcML's union
feature is pretty straightforward to use. In the simplest case we just wrap a series of data source declarations - netCDF files in this case - inside an <aggregation>
element.
<?xml version="1.0" encoding="UTF-8"?>
<netcdf xmlns="http://www.unidata.ucar.edu/namespaces/netcdf/ncml-2.2">
<aggregation type="union">
<netcdf location="/path/to/your/data/precip.nc"/>
<netcdf location="/path/to/your/data/windspeed.nc"/>
</aggregation>
<attribute name="title" type="string" value="Simple union aggregation test"/>
</netcdf>
Loading this example NcML file will result in a cubelist comprising all of the distinct data variables contained within the specified input files. It's easy enough of course to do this kind of data union operation directly in Iris by passing multiple file paths to the iris.load()
function.
One potential benefit provided by the NcML approach, however, lies in the ability to persist desired metadata changes, e.g. renaming variables, adding or modifying attributes, and so on, within the NcML 'virtual dataset'. In this way it becomes a convenient 'write-once, use-many' aggregate dataset.
Simple union of datasets scanned from a directory (or directory hierarchy)
This example is similar to the previous one, but instead of specifying individual data files here we define a directory to scan for filenames ending with a .nc
extension. The attributes subdirs
and olderThan
- both of which are optional - are used here to a) force recursion of the directory hierarchy, and b) select only those files older than 10 minutes before the current system time. Other options are available - refer to the NcML documentation.
<?xml version="1.0" encoding="UTF-8"?>
<netcdf xmlns="http://www.unidata.ucar.edu/namespaces/netcdf/ncml-2.2">
<aggregation type="union">
<scan location="/path/to/your/data/dir" suffix=".nc" subdirs="true" olderThan="10 min"/>
</aggregation>
<attribute name="title" type="string" value="Union-with-scan aggregation test"/>
</netcdf>
As before, this functionality is now also provided by Iris's assorted data loading functions. Nonetheless, being able to capture the definition of an aggregate dataset in a fairly clean, concise way has its merits and applications.
Aggregation of data variables along an existing coordinate
It's not uncommon to have large datasets partitioned across multiple data files, the data being split along one of the dataset's common dimensions. Often this is the time dimension since many problem domains involve long time-series of data.
NcML's joinExisting
feature may be used to aggregate such datasets, as illustrated in the fragment of XML shown below. It's assumed that the specified files contain a contiguous sequence of data across the aggregation dimension (time, in this example). The NcML loader makes use of Iris's concatentation functions to aggregate the data into a minimum number of cubes; usually just a single cube if the files each contain the same data variable.
<?xml version="1.0" encoding="UTF-8"?>
<netcdf xmlns="http://www.unidata.ucar.edu/namespaces/netcdf/ncml-2.2">
<!-- aggregate existing files over the time dimension -->
<aggregation type="joinExisting" dimName="time">
<netcdf location="/path/to/your/data/file1.nc"/>
<netcdf location="/path/to/your/data/file2.nc"/>
<netcdf location="/path/to/your/data/file3.nc"/>
</aggregation>
</netcdf>
Note: If the data files contain inconsistent metadata (e.g. mis-matched history or timestamp attributes) then this can impede the concatenation operation.
In addition to the basic example above, it's also possible to redefine the coordinates assigned to the aggregation dimension. In the NcML fragment below, the time coordinates are redefined so as to reference the same time origin (i.e. midnight on Jan 1, 2000). This mechanism can also be used fo fix things like incorrect calendar settings.
<!-- declare the time dimension -->
<dimension name="time" length="3"/>
<!-- redefine the coordinate variable for the time dimension -->
<variable name="time" type="float" shape="time">
<attribute name="standard_name" type="string" value="time"/>
<attribute name="units" type="string" value="days since 2000-01-01 0:0:0"/>
<attribute name="calendar" type="string" value="360_day"/>
<values start="15" increment="30" npts="3"/>
</variable>
<aggregation type="joinExisting" dimName="time">
...
</aggregation>
In this last example, the new coordinates values could alternatively be specified within the body of the <values>
element, e.g.
...
<variable name="time" type="float" shape="time">
...
<values>15 45 75</values>
</variable>
...
In either case, the number of coordinate values defined in the NcML file must match the total length of the aggregated dataset as serialized across the specified set of input data files.
Note: the Iris package includes functionality for extracting time information from the names of input files. Refer to the Iris documentation for examples of how to do this.
Aggregation of data variables along a newly-defined coordinate
On occasions, it's desirable to aggregate a multi-file dataset across a common dimension, but the coordinates for that dimension either are not explicitly encoded within the input files, or the coordinate information is present but not easily extracted using available tools.
In these situations, NcML's joinNew
aggregation feature can be used to specify the coordinates (and supporting metadata) for the aggregation dimension. The syntax of the joinNew
element is very similar to that of the joinExisting
element seen earlier.
In the example below we aggregate chunks of a data variable called rainfall
along the time dimension.
<?xml version="1.0" encoding="UTF-8"?>
<netcdf xmlns="http://www.unidata.ucar.edu/namespaces/netcdf/ncml-2.2">
<!-- define a new time dimension -->
<dimension name="time" length="3"/>
<!-- define a new coordinate variable for the time dimension -->
<variable name="time" type="int" shape="time">
<attribute name="standard_name" type="string" value="time"/>
<attribute name="units" type="string" value="days since 2006-01-01 0:0:0"/>
<attribute name="calendar" type="string" value="360_day"/>
<values>15 45 75</values>
</variable>
<aggregation type="joinNew" dimName="time">
<variableAgg name="rainfall"/>
<netcdf location="file1.nc"/>
<netcdf location="file2.nc"/>
<netcdf location="file3.nc"/>
</aggregation>
</netcdf>
Multiple aggregation variables can be requested by inserting additional <variableAgg>
elements, one for each variable.
The NcML loader users Iris's merge_cube()
function to merge the full data for a given variable into a single cube. As with the joinExisting
operation, however, conflicting metadata attached to the input data can cause this operation to fail. Thus, it's worth checking that the metadata is consistent before attempting this kind of aggregation.
Add, modify and remove metadata
One of the nice features of NcML is the ability to easily add new metadata attributes to data variables, or else modify or remove existing attributes. In this example, let's assume we have a netCDF file called tempdata.nc
which contains a temperature variable called tas
. We want to add a long_name
attribute, fix the units
attribute so as to be CF-compliant, and remove a redundant attribute called platform
. While we're at it we'll add a global source
attribute which will get attached to all cubes.
Here's how we might define such attribute updates for a temperature variable within an NcML file.
<?xml version="1.0" encoding="UTF-8"?>
<netcdf xmlns="http://www.unidata.ucar.edu/namespaces/netcdf/ncml-2.2" location="tempdata.nc">
...
<!-- modify metadata attributes for variable(s) named 'temperature' -->
<variable name="temperature" type="float">
<attribute name="long_name" type="string" value="10m Air Temperature"/>
<attribute name="units" type="string" value="degC"/>
<remove name="platform" type="attribute"/>
</variable>
<attribute name="source" type="string" value="Data produced by GTi earth system model.">
...
</netcdf>
Renaming a data variable
Renaming a data variable is straightforward. In this example, the variable precip
is renamed to rainfall
.
<?xml version="1.0" encoding="UTF-8"?>
<netcdf xmlns="http://www.unidata.ucar.edu/namespaces/netcdf/ncml-2.2" location="precip.nc">
<variable name="rainfall" orgName="precip"/>
</netcdf>
In Iris terms this results in a cube's var_name
attribute being set to rainfall
. If you need to set, or update, the CF standard name or long name properties, use <attribute>
elements to do that, e.g.
<?xml version="1.0" encoding="UTF-8"?>
<netcdf xmlns="http://www.unidata.ucar.edu/namespaces/netcdf/ncml-2.2" location="precip.nc">
<variable name="rainfall" orgName="precip">
<attribute name="standard_name" type="string" value="rainfall_amount"/>
<attribute name="long_name" type="string" value="Rainfall amount in imperial buckets/day"/>
...
</variable>
</netcdf>