Re: [buildcheapeeg] EDF and DDF file formats

From: Jim Peters (jim_at_uazu.net)
Date: 2002-03-12 11:17:39


Sar Saloth wrote:
> EDF and DDF look perfect for putting the records into storage.
>
> What about an XML based two-way stream (ignoring SOAP etc.. since it
> is already a totally dedicated channel) and putting the actual
> channel data in a kind of escape. I would leave it binary even if
> that isn't purely XML compatible. EDF or DDF should be OK for the
> binary data format.

Requirements to make this real:

- We need a suitable XML library that will build on both MinGW and
Linux. Probably there is one, but someone needs to look at this.

- Someone will have to write a class or library to read/write/seek
around this kind of file.

I don't personally have experience with XML processors, but if someone
really knows that stuff, then they have a job!

If we're working from EDF, then I'd like to see this additional option
(this could perhaps be added later if necessary):

- EDF allows only 16-bit signed integers. I would also like the
option of storing 32-bit floats, just for future expandability
(e.g. the 18-bit or 24-bit systems that have been discussed).

The chunked binary format looks usable, as hopefully even if we have
different sampling rates, they should all be based on a single clock,
so there would always be the same number of samples in a chunked
interval. (True?)

Thinking of the files I've been working with from Jim-M, there will
still be data loss when converting from the serial format to this
chunked binary format, because there is no way to represent sync loss.
In some of Jim-M's files there is sync loss, but that isn't a reason
to discard the whole file (especially if it is an important session).
Perhaps having an error channel which stores just one bit per sample
would do the job -- each bit is 0: no errors, 1: sync or other error.

Another question on the subject of XML/etc is whether it is possible
to edit the descriptive session data after the recording is made (e.g.
immediately after the session ends). In EDF, everything is in a fixed
format, so it is possible to rewrite the initial part of the file
after it has been recorded. However, using XML, you would have to
rewrite the whole file (probably streaming from the original to a
temporary, and then renaming). These files may be huge, remember.

This problem could be avoided by putting all the XML in a separate
file, so we have files in pairs -- the session description, and the
session binary data.

This would also allow a simple program (e.g. BWView) to deal with the
raw session data without having any knowledge of XML. The XML files
could also be collected together to put in Andreas's planned database
system, giving both portability of session files (just copy the pair
of files), and database capability (if that is interesting for someone
to develop further down the line).

I don't know about the XML side of the whole thing (or even whether
XML is good for this), but for the binary bit, we can either use the
EDF format, with most of the header fields blank, or define a
shortened header format for our own purposes. I can define the binary
format if this approach looks useful.

Anything I've missed ?

Jim

P.S. I might as well define a possible format for the binary file
while we're discussing it. The following format assumes that we put
most of the descriptive stuff into a separate file (XML based,
perhaps).

[ Giving credit where it is due, this format is based on the EDF thing
and recent discussions, the stuff John Morrison turned up originally
and the comments from Dave about the BioGraph format. ]

All multi-byte values are stored in little-endian order. Note that
there is no guarantee that values are aligned on 2-byte or 4-byte
boundaries, especially in the channel data chunks.

Length Value
-------------------------------------------------------------------
16 "OpenEEG-1.0", padded with NULs to 16 bytes

4 Length of global header data following (integer, == 12 currently)
4 Number of channels (integer)
4 Bytes per data chunk (integer)
4 Duration of data chunk in seconds (32-bit float)
?? Additional global header data (0 bytes, currently)

4 Length of header data per channel (integer, == 12 currently)
{
4 Byte-count for this channel per data chunk (integer)
4 Format (integer):
's' 2-byte little-endian signed integers
'f' 4-byte little-endian IEEE-*** floats
'e' 1-bit error values, packed 8 to a byte, b0->b7
4 Scaling factor for data (if appropriate, else 1.0) (32-bit float)
?? Additional per-channel header data (0 bytes, currently)
} x number of channels

{
?? data for first channel
?? data for second channel
:: (etc)
?? data for last channel
?? padding if for some reason the data chunk length is
greater than sum of channel byte-counts
} x (data chunks repeated to end of file)
-------------------------------------------------------------------

Features of this:

- The file type is recognisable from the "OpenEEG..." header, and we
have a version number. I suggest that programs check the major
version number. However, different minor version number formats
should be readable by any code that understands the major version
format.

- The aim of the format is simply to allow the stored data to be
unpacked into streams of scaled floating-point values. The meanings
and ranges of those data values are not defined here.

- If we add new "format" types, then older programs can just choose to
skip over them without understanding them, as we are storing a
byte-count rather than a sample-count for the per-channel data.
This also means that programs that don't want to bother with reading
an error channel can just skip over that without having to
understand it.

- The scaling factor is to allow us to store, say, 12-bit data in the
16-bit signed data type ('s'), but to scale it correctly to a range
such as -1 to +1 when it is displayed or processed.

- There is room for expanding the format to put more data in the
header at a later date (e.g. for later minor versions of the format)
without breaking older programs, so long as those programs read the
"Length of global header data following" and "Length of header data
per channel" fields, and skip over any bytes where indicated.

- The number of samples per chunk can be calculated from the format
type and bytes-per-chunk value. This works for all but 'e', where
there might be 1-7 extra bits left over at the end. However 'e'
describes other data streams, so the meaning is clear here. (If
really necessary we could put a "Samples per chunk" value in the
per-channel header, but I think it would be redundant).

- I'm not sure if IEEE 32-bit floats are really endian-dependent. But
in any case, we can store them as they are stored on the i486 series
of chips.

Anything I've missed ?

I can adapt the BWView "file" input code to handle this format, with a
few modifications to the interface.

-- 
Jim Peters (_)/=\~/_(_) jim_at_uazu.net
(_) /=\ ~/_ (_)
Uazú (_) /=\ ~/_ (_) http://
B'ham, UK (_) ____ /=\ ____ ~/_ ____ (_) uazu.net


This archive was generated by hypermail 2.1.4 : 2002-07-27 12:28:40 BST