-
Notifications
You must be signed in to change notification settings - Fork 34
Description
Current state of chunk 3
The current sample chunk type (tag 3) requires at least one branch per sample (2 for string samples) so both the Python and Matlab implementation require compiled extensions.
Also, the current code to "compress" timestamps isn't helping much due to the usual floating point equality problems, even under perfect conditions:
In [1]: import numpy as np
In [2]: ts = np.arange(151500, 151600, 1/1000)
In [3]: np.sum(np.diff(ts)==1/1000)
Out[3]: 0
In [4]: np.diff(ts)
Out[4]: array([0.001, 0.001, 0.001, ..., 0.001, 0.001, 0.001])
In [5]: np.diff(ts)-1/1000
Out[5]:
array([-1.07102096e-11, -1.07102096e-11, -1.07102096e-11, ...,
-1.07102096e-11, -1.07102096e-11, -1.07102096e-11])
In [6]: np.mean(np.diff(np.arange(0, 10, 1/100))==1/100)
Out[6]: 0.002 # 0.2% gets actually left out
Proposal: new chunk type
My proposal for a new chunk type therefore looks as follows:
[Tag 7, uint16] [streamid, uint32] [n_samples uint32] [n_channels uint32] [typeid uint8] # chunk header
[timestamps, n_samples * double64] # timestamps, 0: left out
[samples, n_channels * n_samples * channel_type] # sample data
Each chunk would require at most 3 reads:
streamid, n_samples, n_channels, typeid = struct.unpack('<IIIB', f.read(13))
stream_type = stream_info[streamid].type
timestamps = np.frombuffer(f.read(num_samples*8), np.float64)
data = np.frombuffer(f.read(num_samples * stream.samplebytes), stream.dtype).reshape(n_samples, n_channels)
The chunk includes sample count, channel count and data type for increased robustness (errors can be detected earlier) and so data can be read without an XML parser.
For the timestamp array, a value of 0 (0x0000000000000000 in IEEE 754 representation) indicates a left out timestamp that would be recomputed as usual. Since this chunk format is more compression friendly due to the reduced number of reads / writes, a gzip compressed array would also require less space.
Implementation
The writer is implemented in https://github.com/labstreaminglayer/App-LabRecorder/tree/xdfwriter_chunk7. The main serialization code looks as follows:
_write_chunk_header(chunk_tag_t::samples_v2, len, &streamid);
write_little_endian(file_, n_channels);
write_little_endian(file_, n_samples);
write_little_endian(file_, sampletype);
write_sample_values(file_, timestamps);
write_sample_values(file_, chunk, n_samples * n_channels);
The reader is implemented in https://github.com/tstenner/xdf/commit/412998f3ed66cc7e85a93d77a1d7c96f81073ff0
Benchmark (more below)
A single XDF file with one 64-channel EEG (double) stream was created with 30 chunks, each containing 10K samples.
Reading (xdfz) Reading (xdf)
Chunk 3: 3.094s 1.704
Chunk 7: 0.961 0.399
In both cases, the new chunks are smaller (156.3 vs. 156.0 MiB, 36.0 vs. 34.9 MiB compressed with gzip -6).
Open questions
[ ] layout for string chunks
[ ] optional padding so data starts at page boundaries (e.g. for mmap
)?
[ ] should the redundant type index be included?