Re: [buildcheapeeg] EDF and DDF file formats

From: Dave (dfisher_at_pophost.com)
Date: 2002-03-12 20:37:37


On Tue, 12 Mar 2002 11:17:39 +0000, Jim Peters wrote:

Just a quick preamble to this discussion--I see two levels to our file storage
of data:

1. Raw output from the biofeedback device, and
2. Customized storage (EDF, XML, or otherwise) of processed data.

But is there any reason to use the first kind of file (raw storage) for
anything else other than simulation purposes? Jim-M, I recall you mentioning
at one point how the raw output from the device was important, but I do not
recall in what way you used this? Was it only for after-the-fact analysis?

I can also see it useful in that it would be helpful for other developers who
do not have biosensory equipment yet to work on code, but beyond that, does it
have any further purpose?

I only mention this because right now our discussion is focused on the second
of the two points above, and that if we are going to keep the raw data stream
coming from the biodevice encoder, then that information (as to what the
encoder is, ProComp, BrainMaster, Jeorg's, Andreas', etc.) will need to be
stored in our data files for the session also so that the raw file can be read
by the correct Device class.

>Requirements to make this real:
>
>- We need a suitable XML library that will build on both MinGW and
> Linux. Probably there is one, but someone needs to look at this.
>
>- Someone will have to write a class or library to read/write/seek
> around this kind of file.

I am attracted to keeping session data (client name, electrode placement,
annotation marks, etc.) separate from biosensory data for a variety of reasons.
Storing immense amounts of time-series related data in the ASCII format of XML
seems an odd thing to me, especially when you start thinking in terms of
navigating that file for playback purposes. I also think that session
information might be more fluid, at least in terms of evolution and what we
decide to add, whereas (hopefully), the storage of the time-series related data
will not be. In that sense, XML can be used for session data, and a binary
format for the time-series data. I, too, have never worked with XML, but it
sounds like a good option for our session data and I'm sure that due to its
popularity that there are portable libraries available.

Now; quick question about playback of previously recorded data. From what I
can tell of BioGraph, they grab the entire session's data and hold it in RAM.
I like your method, Jim-P, where it looks like you grab a buffered "chunk" from
the file in BWView and then only grab another chunk if you need it as you move
forward and backwards through the data. However, if we felt that it was
"afforadable" in terms of RAM resources, it would be easier to index a stored
array (vector, or whatever) as well as faster than trying to seek and
reposition into a data file, and, if the amount of data exceeds the current RAM
resources of the machine, let the swapper take care of the rest. I dunno.. I
may be just being lazy, too. :)

>I don't personally have experience with XML processors, but if someone
>really knows that stuff, then they have a job!
>
>If we're working from EDF, then I'd like to see this additional option
>(this could perhaps be added later if necessary):
>
>- EDF allows only 16-bit signed integers. I would also like the
> option of storing 32-bit floats, just for future expandability
> (e.g. the 18-bit or 24-bit systems that have been discussed).
>
>The chunked binary format looks usable, as hopefully even if we have
>different sampling rates, they should all be based on a single clock,
>so there would always be the same number of samples in a chunked
>interval. (True?)

About the only snag I can think of is if we ever support more than one device
simultaneously. This was something that was always at the back of my mind, but
as I think about it more, it might simply add a level of complexity that is not
worth it. My thoughts were that it might be hard to get all the modalities in
one place that you want to work with. I chose the ProComp+ simply because it
supported several types of modalities--from EEG to HR to respiration, etc. But
the thing is pricey, and if there were a less expensive solution to using one
or more devices that achieved the same purpose, then I was going to explore
that.

But wouldn't these samples coming from multiple devices still be based on the
same "chuncked interval" based on the computer's clock? In that sense, I was
thinking that the data would be stored and recorded as multiple channel data,
regardless if they came from one device or ten.

>Thinking of the files I've been working with from Jim-M, there will
>still be data loss when converting from the serial format to this
>chunked binary format, because there is no way to represent sync loss.
>In some of Jim-M's files there is sync loss, but that isn't a reason
>to discard the whole file (especially if it is an important session).
>Perhaps having an error channel which stores just one bit per sample
>would do the job -- each bit is 0: no errors, 1: sync or other error.

I have a question which comes from the work I have been doing with the ProComp.
There is no way to know just how much data has been lost when I lose the sync
byte. I might be able to make certain inferences if I assume that only one
packet set was lost (they transmit a total of 144 bytes in each set, which
presents 24 samples of EEG data from channels A&B and 3 samples of other
biosensory data (such as GSR, HR, etc.). But if more than one set is lost,
then I could not even compute an estimate of how much data might be missing
from the time of the last sync. Thus, all I could do in the above scenario is
throw an error bit the moment I realize that I am no longer reading valid data,
wait for the sync and resume. Is that what you had in mind for situations such
as this?

>All multi-byte values are stored in little-endian order. Note that
>there is no guarantee that values are aligned on 2-byte or 4-byte
>boundaries, especially in the channel data chunks.
>
>Length Value
>-------------------------------------------------------------------
>16 "OpenEEG-1.0", padded with NULs to 16 bytes
>
>4 Length of global header data following (integer, == 12 currently)
>4 Number of channels (integer)
>4 Bytes per data chunk (integer)
>4 Duration of data chunk in seconds (32-bit float)

What would be our minimum resolution be (if any)? Would we ever have, say, < 1
second, and thus have several "chunks" per second? Is that why you have chosen
to represent this as a floating point value? And what determines this duration
when the file is being stored? Is that something we need to configure via
software options?

Also, could you speak to the issues surrounding the A/D clock and the role it
plays in all of this? I am fuzzy on this issue, and came across several
references as I was reading the EDF material, and you have a very good grasp of
this aspect.

Otherwise, I love the format being fleshed out. The only thing I would add is
to situate it in a directory structure similar to the one used by BioGraph. It
is a nicely organized way to group data. Perhaps something like this:

<root data directory>
<clientID>
<yyyymmdd.nnn>
session data files for one recording

where 'yyyy' is the year, 'mm' the two digit month, 'dd' the two digit year,
and 'nnn' is a sequence number based on the number of recorded sessions that
day. This arrangement makes it sort nicely in directory displays based on
date.

Dave.



This archive was generated by hypermail 2.1.4 : 2002-07-27 12:28:40 BST