ENH: sample chunk type optimized for vectorized reading

# Current state of chunk 3

The current sample chunk type (tag 3) requires at least one branch per sample (2 for string samples) so both the Python and Matlab implementation require compiled extensions.

Also, the current code to "compress" timestamps isn't helping much due to the usual floating point equality problems, even under perfect conditions:
```python
In [1]: import numpy as np

In [2]: ts = np.arange(151500, 151600, 1/1000)

In [3]: np.sum(np.diff(ts)==1/1000)
Out[3]: 0

In [4]: np.diff(ts)
Out[4]: array([0.001, 0.001, 0.001, ..., 0.001, 0.001, 0.001])

In [5]: np.diff(ts)-1/1000
Out[5]: 
array([-1.07102096e-11, -1.07102096e-11, -1.07102096e-11, ...,
       -1.07102096e-11, -1.07102096e-11, -1.07102096e-11])

In [6]: np.mean(np.diff(np.arange(0, 10, 1/100))==1/100)
Out[6]: 0.002 # 0.2% gets actually left out
```

# Proposal: new chunk type

My proposal for a new chunk type therefore looks as follows:

```
[Tag 7, uint16] [streamid, uint32] [n_samples uint32] [n_channels uint32] [typeid uint8] # chunk header
[timestamps, n_samples * double64]                    # timestamps, 0: left out
[samples, n_channels * n_samples * channel_type]      # sample data
```

Each chunk would require at most 3 reads:
```python
streamid, n_samples, n_channels, typeid = struct.unpack('<IIIB', f.read(13))
stream_type = stream_info[streamid].type
timestamps = np.frombuffer(f.read(num_samples*8), np.float64)
data = np.frombuffer(f.read(num_samples * stream.samplebytes), stream.dtype).reshape(n_samples, n_channels)
```
The chunk includes sample count, channel count and data type for increased robustness (errors can be detected earlier) and so data can be read without an XML parser.

For the timestamp array, a value of 0 (0x0000000000000000 in IEEE 754 representation) indicates a left out timestamp that would be recomputed as usual. Since this chunk format is more compression friendly due to the reduced number of reads / writes, a gzip compressed array would also require less space.

# Implementation

The writer is implemented in https://github.com/labstreaminglayer/App-LabRecorder/tree/xdfwriter_chunk7. The main serialization code looks as follows:

```cpp
_write_chunk_header(chunk_tag_t::samples_v2, len, &streamid);
write_little_endian(file_, n_channels);
write_little_endian(file_, n_samples);
write_little_endian(file_, sampletype);
write_sample_values(file_, timestamps);
write_sample_values(file_, chunk, n_samples * n_channels);
```

The reader is implemented in https://github.com/tstenner/xdf/commit/412998f3ed66cc7e85a93d77a1d7c96f81073ff0

# Benchmark ([more below](https://github.com/sccn/xdf/issues/19#issuecomment-482695092))

A single XDF file with one 64-channel EEG (double) stream was created with 30 chunks, each containing 10K samples.

```
          Reading (xdfz) Reading (xdf)
Chunk 3:  3.094s         1.704
Chunk 7:  0.961          0.399
```

In both cases, the new chunks are smaller (156.3 vs. 156.0 MiB, 36.0 vs. 34.9 MiB compressed with gzip -6).

# Open questions

[ ] layout for string chunks
[ ] optional padding so data starts at page boundaries (e.g. for `mmap`)?
[ ] should the redundant type index be included?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ENH: sample chunk type optimized for vectorized reading #19

Current state of chunk 3

Proposal: new chunk type

Implementation

Benchmark (more below)

Open questions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

ENH: sample chunk type optimized for vectorized reading #19

Description

Current state of chunk 3

Proposal: new chunk type

Implementation

Benchmark (more below)

Open questions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions