Skip to content

ENH: sample chunk type optimized for vectorized reading #19

@tstenner

Description

@tstenner

Current state of chunk 3

The current sample chunk type (tag 3) requires at least one branch per sample (2 for string samples) so both the Python and Matlab implementation require compiled extensions.

Also, the current code to "compress" timestamps isn't helping much due to the usual floating point equality problems, even under perfect conditions:

In [1]: import numpy as np

In [2]: ts = np.arange(151500, 151600, 1/1000)

In [3]: np.sum(np.diff(ts)==1/1000)
Out[3]: 0

In [4]: np.diff(ts)
Out[4]: array([0.001, 0.001, 0.001, ..., 0.001, 0.001, 0.001])

In [5]: np.diff(ts)-1/1000
Out[5]: 
array([-1.07102096e-11, -1.07102096e-11, -1.07102096e-11, ...,
       -1.07102096e-11, -1.07102096e-11, -1.07102096e-11])

In [6]: np.mean(np.diff(np.arange(0, 10, 1/100))==1/100)
Out[6]: 0.002 # 0.2% gets actually left out

Proposal: new chunk type

My proposal for a new chunk type therefore looks as follows:

[Tag 7, uint16] [streamid, uint32] [n_samples uint32] [n_channels uint32] [typeid uint8] # chunk header
[timestamps, n_samples * double64]                    # timestamps, 0: left out
[samples, n_channels * n_samples * channel_type]      # sample data

Each chunk would require at most 3 reads:

streamid, n_samples, n_channels, typeid = struct.unpack('<IIIB', f.read(13))
stream_type = stream_info[streamid].type
timestamps = np.frombuffer(f.read(num_samples*8), np.float64)
data = np.frombuffer(f.read(num_samples * stream.samplebytes), stream.dtype).reshape(n_samples, n_channels)

The chunk includes sample count, channel count and data type for increased robustness (errors can be detected earlier) and so data can be read without an XML parser.

For the timestamp array, a value of 0 (0x0000000000000000 in IEEE 754 representation) indicates a left out timestamp that would be recomputed as usual. Since this chunk format is more compression friendly due to the reduced number of reads / writes, a gzip compressed array would also require less space.

Implementation

The writer is implemented in https://github.com/labstreaminglayer/App-LabRecorder/tree/xdfwriter_chunk7. The main serialization code looks as follows:

_write_chunk_header(chunk_tag_t::samples_v2, len, &streamid);
write_little_endian(file_, n_channels);
write_little_endian(file_, n_samples);
write_little_endian(file_, sampletype);
write_sample_values(file_, timestamps);
write_sample_values(file_, chunk, n_samples * n_channels);

The reader is implemented in https://github.com/tstenner/xdf/commit/412998f3ed66cc7e85a93d77a1d7c96f81073ff0

Benchmark (more below)

A single XDF file with one 64-channel EEG (double) stream was created with 30 chunks, each containing 10K samples.

          Reading (xdfz) Reading (xdf)
Chunk 3:  3.094s         1.704
Chunk 7:  0.961          0.399

In both cases, the new chunks are smaller (156.3 vs. 156.0 MiB, 36.0 vs. 34.9 MiB compressed with gzip -6).

Open questions

[ ] layout for string chunks
[ ] optional padding so data starts at page boundaries (e.g. for mmap)?
[ ] should the redundant type index be included?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions