Re: [buildcheapeeg] EDF and DDF file formats

From: Sar Saloth (sarsaloth_at_yahoo.com)
Date: 2002-03-12 22:17:39


Please take my comments with the following caveat: My application is not
identical to that in this group, I was trying to see if I could piggy-back
and/or contribute. It is possible that anything I say may actually be
throwing the work off, so just disregard what doesn't seem appropriate,
although there may be more in common that one would suspect.

I have confused things a bit because I started rambling about a
communications and control channel, when the header clearly referred to a
disk format.

For my useless two cents worth, what deficiencies do we find with
EDF? What are the disadvantages?

At 03:37 PM 2002-03-12 -0500, you wrote:
>On Tue, 12 Mar 2002 11:17:39 +0000, Jim Peters wrote:
>
>Just a quick preamble to this discussion--I see two levels to our file storage
>of data:
>
>1. Raw output from the biofeedback device, and
>2. Customized storage (EDF, XML, or otherwise) of processed data.
>
>But is there any reason to use the first kind of file (raw storage) for
>anything else other than simulation purposes?

I will be writing for embedded systems with very little hardware
resources. EDF looks OK to me for a file format, but for real-time
communication and control I want to combine data and control. Putting the
actual data (which ?would? be most of the data stream) into true XML would
be inefficient. However, control and status information should be
relatively low bandwidth, so I was thinking that would be OK for XML. The
best way "I" could think of combining them would be to use EDF for the
highest level (outside or wrapper?) and escape to EDF records for the
actual data.
As an important note, I was not thinking of implementing a full XML parser
with a complex nested format. It is possible to maintain XML syntax from
an essentially "flat" structure. I don't know the real software term for
flat, but I mean something that wouldn't have nested records (nodes?). If
the descriptors are kept small and the order is strictly specified, the
complexity of an embedded in-line parser would be small.
Most of the data flow from the device would end up being binary.
Reason ---> I want to handle multiple devices and types of devices without
significantly impacting the architecture of the data collection
program. The actual routine that talks to the physical channel could
simply separate the control from the data and pass it on to the correct
modules. The only reason for the XML in the data and control (in the RAW
stream) is due to the complexity of programming control and status states
for devices. One device I made had a control stream of about 200
bits. The programmers made (and continue to make) many errors trying to
control the machine. I vowed in the future to leave the low-level
bit-twiddling to low-level people and let the programmers have simple human
readable text commands.

> Jim-M, I recall you mentioning
>at one point how the raw output from the device was important, but I do not
>recall in what way you used this? Was it only for after-the-fact analysis?
>
>I can also see it useful in that it would be helpful for other developers who
>do not have biosensory equipment yet to work on code, but beyond that, does it
>have any further purpose?
>
>I only mention this because right now our discussion is focused on the second
>of the two points above, and that if we are going to keep the raw data stream
>coming from the biodevice encoder, then that information (as to what the
>encoder is, ProComp, BrainMaster, Jeorg's, Andreas', etc.) will need to be
>stored in our data files for the session also so that the raw file can be read
>by the correct Device class.

Isn't that kind of yucky from an architectural point of view? Wouldn't it
be nice of there was a bit of code that talked to the physical device that
then put things into a common format? The software that I currently deal
with supports more than 5 different devices, each with its own control and
communications requirements. If you leave all of that on the disk file,
you will have a software mess IMHO (but hey, I am not a programmer).

> >Requirements to make this real:
> >
> >- We need a suitable XML library that will build on both MinGW and
> > Linux. Probably there is one, but someone needs to look at this.

There are many libraries of such things. However, I wanted something that
would work on a very minimal device. I have already discussed with some
XML (consultants at http://www.web2xml.com ) people how to write a very
small parser for a simplified XML. .

> >
> >- Someone will have to write a class or library to read/write/seek
> > around this kind of file.

?Standard? methods for working with XML do that. It is essentially
traversing a tree.
As far as the control format, there wouldn't be anything to
traverse. Using the term XML really just describes the lowest level syntax
of mixing control and data. A disk format is a totally different thing - I
apologize for confusing things.

>I am attracted to keeping session data (client name, electrode placement,
>annotation marks, etc.) separate from biosensory data for a variety of
>reasons.
> Storing immense amounts of time-series related data in the ASCII format
> of XML
>seems an odd thing to me, especially when you start thinking in terms of
>navigating that file for playback purposes.

If you look at the EDF documents, I think you would see that storing those
annotations and information do not add a terrible amount to file size
unless the files are very short. They have nicely covered the things you
mentioned.

> I also think that session
>information might be more fluid, at least in terms of evolution and what we
>decide to add, whereas (hopefully), the storage of the time-series related
>data
>will not be. In that sense, XML can be used for session data, and a binary
>format for the time-series data.

Besides annotations, what sorts of session data is important? I think most
typical requirements are mentioned in the EDF.

> I, too, have never worked with XML, but it
>sounds like a good option for our session data and I'm sure that due to its
>popularity that there are portable libraries available.
>
>Now; quick question about playback of previously recorded data. From what I
>can tell of BioGraph, they grab the entire session's data and hold it in RAM.
>I like your method, Jim-P, where it looks like you grab a buffered "chunk"
>from
>the file in BWView and then only grab another chunk if you need it as you move
>forward and backwards through the data. However, if we felt that it was
>"afforadable" in terms of RAM resources, it would be easier to index a stored
>array (vector, or whatever) as well as faster than trying to seek and
>reposition into a data file, and, if the amount of data exceeds the
>current RAM
>resources of the machine, let the swapper take care of the rest. I dunno.. I
>may be just being lazy, too. :)
>
> >I don't personally have experience with XML processors, but if someone
> >really knows that stuff, then they have a job!
> >
> >If we're working from EDF, then I'd like to see this additional option
> >(this could perhaps be added later if necessary):
> >
> >- EDF allows only 16-bit signed integers. I would also like the
> > option of storing 32-bit floats, just for future expandability
> > (e.g. the 18-bit or 24-bit systems that have been discussed).

One of the advantages of encapsulating the data and control stream into XML
would be the simplicity of identifying a different binary data format. As
soon as you go outside of the EDF, then the current software wouldn't work,
so anything should be OK as long as everyone agrees. Does that make
sense? However, the name would need to be changed FROM EDF to something else.

> >
> >The chunked binary format looks usable, as hopefully even if we have
> >different sampling rates, they should all be based on a single clock,
> >so there would always be the same number of samples in a chunked
> >interval. (True?)
>
>About the only snag I can think of is if we ever support more than one device
>simultaneously. This was something that was always at the back of my
>mind, but
>as I think about it more, it might simply add a level of complexity that
>is not
>worth it. My thoughts were that it might be hard to get all the modalities in
>one place that you want to work with. I chose the ProComp+ simply because it
>supported several types of modalities--from EEG to HR to respiration,
>etc. But
>the thing is pricey, and if there were a less expensive solution to using one
>or more devices that achieved the same purpose, then I was going to explore
>that.
>
>But wouldn't these samples coming from multiple devices still be based on the
>same "chuncked interval" based on the computer's clock? In that sense, I was
>thinking that the data would be stored and recorded as multiple channel data,
>regardless if they came from one device or ten.

Correct, this is a huge complexity to add to software, especially for CHEAP
systems. If you use different pieces of hardware, even if they have the
same clock frequency, in the absence of a specific method of synchronizing
samples, there would be a difference in the sample rate. One of the
simplifications for this problem is that usually only one piece of hardware
collects the high data rate information and the other pieces can collect
the low rate information. In this case, resampling won't cause much error
in the slow signals as long as the re synchronizing adds only the jitter
equivalent to the fast sample rate and NOT the slow one.

That would be one of the cases of 90% effort for 1% of the applications.
(my guess).

> >Thinking of the files I've been working with from Jim-M, there will
> >still be data loss when converting from the serial format to this
> >chunked binary format, because there is no way to represent sync loss.
> >In some of Jim-M's files there is sync loss, but that isn't a reason
> >to discard the whole file (especially if it is an important session).
> >Perhaps having an error channel which stores just one bit per sample
> >would do the job -- each bit is 0: no errors, 1: sync or other error.

It is necessary to be able to mark data as bad, whether it is due to the
communications channel or something else (such as a lead falling off). I
thought there was provision in the EDF specification for that but I
couldn't find it upon this re-read of the specification. I can think of a
few ways to handle this.
1. If the time that data lost is undefinable, then the points at which
loss of synch occurs really demarcs separate files. In that case, would
DDF not do the job?
2. If a sample is bad, either the suggested method or how about this
suggestion? choose a number outside of the maximum or minimum binary level
to signify a bad sample. Of course that means that with a 16 bit converter
you would have to not use 1 or two of the 56636 codes. I was thinking of
leaving two unused codes, one for a bad sample and one for lead-off. Or
would Lead-off be better handled by the annotation stream?

>I have a question which comes from the work I have been doing with the
>ProComp.
> There is no way to know just how much data has been lost when I lose the
> sync
>byte. I might be able to make certain inferences if I assume that only one
>packet set was lost (they transmit a total of 144 bytes in each set, which
>presents 24 samples of EEG data from channels A&B and 3 samples of other
>biosensory data (such as GSR, HR, etc.). But if more than one set is lost,
>then I could not even compute an estimate of how much data might be missing
>from the time of the last sync. Thus, all I could do in the above scenario is
>throw an error bit the moment I realize that I am no longer reading valid
>data,
>wait for the sync and resume. Is that what you had in mind for situations
>such
>as this?

Is anything else reasonable? Does loss of synch mean that the serial port
couldn't keep up? I have been put under the impression that modern PCs
should be able to handle 115Kbaud. if the loss is a very rare event, then
the data could reasonable be considered corrupt. If the loss of the data
is frequent, don't we have a reliability problem?

> >All multi-byte values are stored in little-endian order. Note that
> >there is no guarantee that values are aligned on 2-byte or 4-byte
> >boundaries, especially in the channel data chunks.
> >
> >Length Value
> >-------------------------------------------------------------------
> >16 "OpenEEG-1.0", padded with NULs to 16 bytes
> >
> >4 Length of global header data following (integer, == 12 currently)
> >4 Number of channels (integer)
> >4 Bytes per data chunk (integer)
> >4 Duration of data chunk in seconds (32-bit float)
>
>What would be our minimum resolution be (if any)? Would we ever have,
>say, < 1
>second, and thus have several "chunks" per second? Is that why you have
>chosen
>to represent this as a floating point value? And what determines this
>duration
>when the file is being stored? Is that something we need to configure via
>software options?

I am aware of applications (although not for EEG) where either the data
record would be too large for the EDF limit. Also, for my application 1
second "chunk" is very bad as the desired latency from device to screen
should be much less than that. The specification does state that as an
option, fractions of a second may be used. I will definitely use data
records smaller than 1 second. I will probably use 0.1 seconds for
everything that I do.

>Also, could you speak to the issues surrounding the A/D clock and the role it
>plays in all of this? I am fuzzy on this issue, and came across several
>references as I was reading the EDF material, and you have a very good
>grasp of
>this aspect.

For many types of signals, especially any where the frequency content of
the signal is to analyzed, it is important that the sampling be precisely
periodic. Therefore, the A/D converter needs to be clocked by a repetitive
clock and not a command from a desktop PC for example. For simplicity of
the data stream, sanity of anyone working with it, and also because many
systems use a single A/D converter multiplexed for many channels or else
multiple A/D converters on the same clock, every channel data will occur
either at a certain base frequency or an integer sub-multiple of it. (what
is the correct term for "sub-multiple")

>Otherwise, I love the format being fleshed out. The only thing I would add is
>to situate it in a directory structure similar to the one used by
>BioGraph. It
>is a nicely organized way to group data. Perhaps something like this:
>
><root data directory>
> <clientID>
> <yyyymmdd.nnn>
> session data files for one recording
>
>where 'yyyy' is the year, 'mm' the two digit month, 'dd' the two digit year,
>and 'nnn' is a sequence number based on the number of recorded sessions that
>day. This arrangement makes it sort nicely in directory displays based on
>date.
>
>Dave.
>
>
>
>
>
>To unsubscribe from this group, send an email to:
>buildcheapeeg-unsubscribe_at_egroups.com
>
>
>
>Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

_________________________________________________________
Do You Yahoo!?
Get your free @yahoo.com address at http://mail.yahoo.com



This archive was generated by hypermail 2.1.4 : 2002-07-27 12:28:40 BST