[MINC-development] Draft proposal for MINC 2.0

Peter NEELIN minc-development@bic.mni.mcgill.ca
Sun, 5 Jan 2003 23:28:17 -0500


It is good to see a grand plan for things that I have longed dreamed
of doing.

Some thoughts on John and Leila's proposal (and David Gobbi and Andrew
Janke's responses):

1) NetCDF itself imposes some serious restrictions on what extensions
   can be added. The blocked structure for large datasets should fit
   well with the NetCDF model, but any bit-compression scheme will
   not, and a more complex organization (colour-transform/
   wavelet/bit-compression combination) would make any NetCDF scheme
   horribly unintelligible, I think. One can, of course, store
   everything in a single-dimensional byte variable (a file in a file,
   effectively), but this seems like a big departure from NetCDF
   thinking.

   I spent a fair bit of time worrying over this issue and I initially
   felt that a scheme could be fitted on top of NetCDF, but now I'm
   not so sure (perhaps it was denial - any change in underlying
   format could be very time-consuming). I think that it is at least
   worth investigating HDF to see if its combination of a very general
   data API with a simpler NetCDF-like API can satisfy all needs,
   especially since it can read (but not write) NetCDF files. (Sorry
   Andrew, but I think that someone should at least look into it.)

2) David asked why good compression has to be incompatible with
   random access and I think that it is a good question. Someone has
   actually married zlib (I think) to NetCDF to get one the fly
   compression with nearly-random access (and files are never truly
   random access anyway). The question of course is how big the
   compression chunks should be, with a tradeoff between speed/memory
   and randomness of access. I suspect that one approaches an
   asymptote in compression with fairly small chunk sizes. The
   problems introduced by disk decompression should not be ignored
   (beyond those raised by David about speed are the more basic ones
   of available disk space, disk management, etc.). I never did really
   understand why the NetCDF folks did not want to incorporate the
   zlib changes. They did give arguments (probably in the FAQ), but I
   don't recall them being compelling.

   Of course, standard lossless compression schemes are only the tip
   of the compression iceberg, and as I mentioned above, I don't think
   that NetCDF is particularly compression-friendly.

3) Having agreed with David on one point, I must now disagree with him
   on another. I have always felt that supporting separate scaling was
   important, since it made it possible for one to process large
   volumes with very little memory. The original MINC model of one
   scaling per slice matched the simple voxel organization, but more
   generally the notion is one scaling per chunk of data, however that
   might be defined. That said, it is not necessary to always have a
   separate scaling per slice. In fact, volume_io applications write
   out one scale per volume. It is simply the generic MINC
   applications, which try to avoid being memory hogs, that write out
   separate scales per slice. In cases where one has the memory and it
   makes more sense to have a single scale, do it. But should one
   impose the requirement of either lots of memory, or a two-pass
   approach (compute, then re-normalize) on all file-writing
   applications?

4) Perhaps I'm dense, but I did not quite get the 2N-1 dimension argument
   for blocked data structure (page 6). If I have a file that has
   dimensions (zspace,yspace,xspace) of sizes (1024,1024,1024), then
   the 2N structure is (zspace2,yspace2,xspace2,zspace1,yspace1,xspace1)
   with sizes (64,64,64,16,16,16), for example. Going to 2N-1 in the
   manner proposed (as I understand it), collapsing the
   fastest-varying dimension, would give
   (zspace2,yspace2,zspace1,yspace1,xspace1) with sizes
   (64,64,16,16,1024) which still requires loading the whole volume to
   get a sagittal slice. However, if one collapses the slowest-varying
   dimension, ones gets (yspace2,xspace2,zspace1,yspace1,xspace1) with sizes
   (64,64,1024,16,16), and loading any orientation would require only
   loading the neighbourhood data. Looking at this again, I suppose
   that one could also do (zspace2,yspace2,xspace2,zspace1,yspace1) with
   sizes (64,64,1024,16,16). Perhaps this is what was meant.

   But after all of this, what is the advantage of dropping a
   dimension (it is certainly less obvious to the casual reader)?

5) The discussion in the proposal on wavelet compression raises the
   potential problem of the cost of modifying a single voxel in a
   file (ie. read/write access). Although minc was originally designed
   to allow read/write access, we have virtually no applications that
   do this. Unless something significant changes (and perhaps large
   volumes prefer to be modified in place, although I have my doubts
   about the advisability of this), this worry seems to be about
   something that virtually never happens. The only type of volume
   that seems to be likely to be read/write is the label volume and
   I do not think that this is a particularly good candidate for
   wavelets. Furthermore, any interactive application that wants a
   working volume backed by a file would simply be well advised to not
   make the backing file have wavelet compression.

   From what I have seen, the real problem with a wavelet structure is
   that the decompression at full-resolution can be slow compared to
   an uncompressed file. This might mean that wavelet-compressed files are
   not good candiates for doing lots of automated
   computation. However, one must compare the decompression time to
   the subsequent computation that will done - I suspect that the
   decompression would be small compared to other calculations.

6) The arguments about data-type issues (page 9) seem to assume that
   caching will always be done. The cost of caching is not negligible,
   since a certain amount of index checking must be done on every
   lookup (remember that the calculation goes back to the volume
   whenever it needs a neighbourhood pixel, so this is not just a read
   problem, but a computation problem). If you do not do caching then
   for memory reasons one might want the internal volume
   representation to be different from the user representation and the
   file representation.

7) One thought related to complex numbers (pages 7-8): The current
   implementation of complex numbers (which is admittedly poorly
   supported) stores them with vector_dimension. Volume_io does
   not handle vector data very well since it uses a pointer per row
   (the fastest-varying dimension). For short vectors and small types,
   this can mean more pointer than data. This issue should be
   addressed in some way if greater use will be made of vector data in
   the future.

8) One of the issues that was not clear in my mind was the mechanism
   for deciding on output file structure in generic tools. If I want
   to do some volume arithmetic (minccalc) on a block-structured file,
   should I get a new block-structured file? What if the file has
   wavelet compression? Or should the output always have standard
   voxel structure unless I specify a set of output structure options.
   But then every application will need to have
   file-structure-related options added whenever a new structure is
   added.

   Related to this is the question of how much an application needs to
   know about file structure. Does the application need to have
   intimate knowledge of the output structure (e.g. doing
   wavelet-related things) or should that all be in the library? If
   so, then how does the user control this? (Weird file names,
   perhaps? Ughhh.) Can one create a compeletely general API that
   allows the application programmer to be unaware of different file
   structures? Can one also provide an API that gives complete control
   to applications that really do want to know about the structure
   (e.g. a wavelet streaming application)? What would all of this look
   like?

9) David raised a question about label and voxel data co-existing. I
   did not read that in the proposal, so perhaps the interpretation
   depends on your pre-disposition. I do not favour putting label data
   in with continuous voxel data, even as separate variables, but
   rather treating the two types of data as equivalent and putting
   them in separate files. Normally, the user would manage the two
   files separately (and they are separate pieces of information that
   are likely to go in different directions), with the advantage that
   tools made for continuous voxel data can be used on label data (and
   label data quickly turns into continuous data with blurring,
   etc.). Alternatively, an application would do the multiple-file
   management on behalf of the user. I think that the disadvantages of
   separating the types of data would outweigh the gain.

   (Re-reading the above, I just want to make clear that I support the
   originally proposed notion of an identifier for label volumes so
   that discrete and uninterpolable data can be properly handled. I
   just do not like the idea of making them completely different. The
   user should be able to treat them in similar ways, but applications
   would sometimes have to handle them differently.)

   That said, I can see the problems that arise in the continuum from
   geometric data to continuous voxel data. One can imagine wanting to
   incorporate ROI information (polyhedral data, for example) in a
   MINC file. But label data is really just another form of ROI
   data. And then a fuzzy label volume is just another form of label
   data. So where's the line? Should MINC incorporate all forms of
   label data, geometric and voxel-based?

   I think that the answer is to have another level of format to
   handle the most general case of label data, that could support
   polyhedra, meshes, voxel grids, etc. This format would be
   able to include MINC data as part of a "scene". Ideally, one would
   marry the MINC format with this format, but the danger is ending up
   with an unimplemented monster. The simplest route is to sidestep
   the problem by deferring it to another, meta, format (that may
   never get implemented, but at least it would not derail the more
   focused MINC effort).

10) Andrew raised the question of Mac OS X support. To my knowledge,
    no Mac OS has ever been officially supported by NetCDF (perhaps
    that has changed with OS X). However, building MINC and its
    command-line friends should not be a big problem. Is the issue one
    of windowing system rather than OS? Should the official MINC
    viewing tools (whatever that means) support X11, Windows and
    whatever-OS-X-calls-its-windowing-system?

11) Andrew makes a good point about conversion. However, I'm not sure
    that the volume_io way is really the best route (all applications
    read every format). I still think that the converter route makes
    life simpler (especially when you add a new format in the world of
    static linking - and beware of dynamic linking if you live in the
    world of software quarantines).

    Still, putting data-conversion into the plan is a good thing,
    since it is usually the first step (and a very non-trivial
    one). The world has simplified itself considerably in the past few
    years, so good DICOM, Analyze (and maybe Interfile?) converters
    would go a long way to making it easier to spread the use of MINC.

12) Would it make sense to develop a rough API definition before
    having a meeting so that the discussions can be more specific? It
    is often easy to have general discussions that talk about general
    principles but that do not lead to specific design. Also, I have
    found that people are more sensitive to potential problems when
    they can translate an API into their own context. General
    principles are often too distant for the omissions to show. John,
    Leila, got the time?


            Peter
----
            Peter Neelin (neelin@bic.mni.mcgill.ca)