[MINC-development] Draft proposal for MINC 2.0
Peter NEELIN
minc-development@bic.mni.mcgill.ca
Sun, 5 Jan 2003 23:28:17 -0500
It is good to see a grand plan for things that I have longed dreamed
of doing.
Some thoughts on John and Leila's proposal (and David Gobbi and Andrew
Janke's responses):
1) NetCDF itself imposes some serious restrictions on what extensions
can be added. The blocked structure for large datasets should fit
well with the NetCDF model, but any bit-compression scheme will
not, and a more complex organization (colour-transform/
wavelet/bit-compression combination) would make any NetCDF scheme
horribly unintelligible, I think. One can, of course, store
everything in a single-dimensional byte variable (a file in a file,
effectively), but this seems like a big departure from NetCDF
thinking.
I spent a fair bit of time worrying over this issue and I initially
felt that a scheme could be fitted on top of NetCDF, but now I'm
not so sure (perhaps it was denial - any change in underlying
format could be very time-consuming). I think that it is at least
worth investigating HDF to see if its combination of a very general
data API with a simpler NetCDF-like API can satisfy all needs,
especially since it can read (but not write) NetCDF files. (Sorry
Andrew, but I think that someone should at least look into it.)
2) David asked why good compression has to be incompatible with
random access and I think that it is a good question. Someone has
actually married zlib (I think) to NetCDF to get one the fly
compression with nearly-random access (and files are never truly
random access anyway). The question of course is how big the
compression chunks should be, with a tradeoff between speed/memory
and randomness of access. I suspect that one approaches an
asymptote in compression with fairly small chunk sizes. The
problems introduced by disk decompression should not be ignored
(beyond those raised by David about speed are the more basic ones
of available disk space, disk management, etc.). I never did really
understand why the NetCDF folks did not want to incorporate the
zlib changes. They did give arguments (probably in the FAQ), but I
don't recall them being compelling.
Of course, standard lossless compression schemes are only the tip
of the compression iceberg, and as I mentioned above, I don't think
that NetCDF is particularly compression-friendly.
3) Having agreed with David on one point, I must now disagree with him
on another. I have always felt that supporting separate scaling was
important, since it made it possible for one to process large
volumes with very little memory. The original MINC model of one
scaling per slice matched the simple voxel organization, but more
generally the notion is one scaling per chunk of data, however that
might be defined. That said, it is not necessary to always have a
separate scaling per slice. In fact, volume_io applications write
out one scale per volume. It is simply the generic MINC
applications, which try to avoid being memory hogs, that write out
separate scales per slice. In cases where one has the memory and it
makes more sense to have a single scale, do it. But should one
impose the requirement of either lots of memory, or a two-pass
approach (compute, then re-normalize) on all file-writing
applications?
4) Perhaps I'm dense, but I did not quite get the 2N-1 dimension argument
for blocked data structure (page 6). If I have a file that has
dimensions (zspace,yspace,xspace) of sizes (1024,1024,1024), then
the 2N structure is (zspace2,yspace2,xspace2,zspace1,yspace1,xspace1)
with sizes (64,64,64,16,16,16), for example. Going to 2N-1 in the
manner proposed (as I understand it), collapsing the
fastest-varying dimension, would give
(zspace2,yspace2,zspace1,yspace1,xspace1) with sizes
(64,64,16,16,1024) which still requires loading the whole volume to
get a sagittal slice. However, if one collapses the slowest-varying
dimension, ones gets (yspace2,xspace2,zspace1,yspace1,xspace1) with sizes
(64,64,1024,16,16), and loading any orientation would require only
loading the neighbourhood data. Looking at this again, I suppose
that one could also do (zspace2,yspace2,xspace2,zspace1,yspace1) with
sizes (64,64,1024,16,16). Perhaps this is what was meant.
But after all of this, what is the advantage of dropping a
dimension (it is certainly less obvious to the casual reader)?
5) The discussion in the proposal on wavelet compression raises the
potential problem of the cost of modifying a single voxel in a
file (ie. read/write access). Although minc was originally designed
to allow read/write access, we have virtually no applications that
do this. Unless something significant changes (and perhaps large
volumes prefer to be modified in place, although I have my doubts
about the advisability of this), this worry seems to be about
something that virtually never happens. The only type of volume
that seems to be likely to be read/write is the label volume and
I do not think that this is a particularly good candidate for
wavelets. Furthermore, any interactive application that wants a
working volume backed by a file would simply be well advised to not
make the backing file have wavelet compression.
From what I have seen, the real problem with a wavelet structure is
that the decompression at full-resolution can be slow compared to
an uncompressed file. This might mean that wavelet-compressed files are
not good candiates for doing lots of automated
computation. However, one must compare the decompression time to
the subsequent computation that will done - I suspect that the
decompression would be small compared to other calculations.
6) The arguments about data-type issues (page 9) seem to assume that
caching will always be done. The cost of caching is not negligible,
since a certain amount of index checking must be done on every
lookup (remember that the calculation goes back to the volume
whenever it needs a neighbourhood pixel, so this is not just a read
problem, but a computation problem). If you do not do caching then
for memory reasons one might want the internal volume
representation to be different from the user representation and the
file representation.
7) One thought related to complex numbers (pages 7-8): The current
implementation of complex numbers (which is admittedly poorly
supported) stores them with vector_dimension. Volume_io does
not handle vector data very well since it uses a pointer per row
(the fastest-varying dimension). For short vectors and small types,
this can mean more pointer than data. This issue should be
addressed in some way if greater use will be made of vector data in
the future.
8) One of the issues that was not clear in my mind was the mechanism
for deciding on output file structure in generic tools. If I want
to do some volume arithmetic (minccalc) on a block-structured file,
should I get a new block-structured file? What if the file has
wavelet compression? Or should the output always have standard
voxel structure unless I specify a set of output structure options.
But then every application will need to have
file-structure-related options added whenever a new structure is
added.
Related to this is the question of how much an application needs to
know about file structure. Does the application need to have
intimate knowledge of the output structure (e.g. doing
wavelet-related things) or should that all be in the library? If
so, then how does the user control this? (Weird file names,
perhaps? Ughhh.) Can one create a compeletely general API that
allows the application programmer to be unaware of different file
structures? Can one also provide an API that gives complete control
to applications that really do want to know about the structure
(e.g. a wavelet streaming application)? What would all of this look
like?
9) David raised a question about label and voxel data co-existing. I
did not read that in the proposal, so perhaps the interpretation
depends on your pre-disposition. I do not favour putting label data
in with continuous voxel data, even as separate variables, but
rather treating the two types of data as equivalent and putting
them in separate files. Normally, the user would manage the two
files separately (and they are separate pieces of information that
are likely to go in different directions), with the advantage that
tools made for continuous voxel data can be used on label data (and
label data quickly turns into continuous data with blurring,
etc.). Alternatively, an application would do the multiple-file
management on behalf of the user. I think that the disadvantages of
separating the types of data would outweigh the gain.
(Re-reading the above, I just want to make clear that I support the
originally proposed notion of an identifier for label volumes so
that discrete and uninterpolable data can be properly handled. I
just do not like the idea of making them completely different. The
user should be able to treat them in similar ways, but applications
would sometimes have to handle them differently.)
That said, I can see the problems that arise in the continuum from
geometric data to continuous voxel data. One can imagine wanting to
incorporate ROI information (polyhedral data, for example) in a
MINC file. But label data is really just another form of ROI
data. And then a fuzzy label volume is just another form of label
data. So where's the line? Should MINC incorporate all forms of
label data, geometric and voxel-based?
I think that the answer is to have another level of format to
handle the most general case of label data, that could support
polyhedra, meshes, voxel grids, etc. This format would be
able to include MINC data as part of a "scene". Ideally, one would
marry the MINC format with this format, but the danger is ending up
with an unimplemented monster. The simplest route is to sidestep
the problem by deferring it to another, meta, format (that may
never get implemented, but at least it would not derail the more
focused MINC effort).
10) Andrew raised the question of Mac OS X support. To my knowledge,
no Mac OS has ever been officially supported by NetCDF (perhaps
that has changed with OS X). However, building MINC and its
command-line friends should not be a big problem. Is the issue one
of windowing system rather than OS? Should the official MINC
viewing tools (whatever that means) support X11, Windows and
whatever-OS-X-calls-its-windowing-system?
11) Andrew makes a good point about conversion. However, I'm not sure
that the volume_io way is really the best route (all applications
read every format). I still think that the converter route makes
life simpler (especially when you add a new format in the world of
static linking - and beware of dynamic linking if you live in the
world of software quarantines).
Still, putting data-conversion into the plan is a good thing,
since it is usually the first step (and a very non-trivial
one). The world has simplified itself considerably in the past few
years, so good DICOM, Analyze (and maybe Interfile?) converters
would go a long way to making it easier to spread the use of MINC.
12) Would it make sense to develop a rough API definition before
having a meeting so that the discussions can be more specific? It
is often easy to have general discussions that talk about general
principles but that do not lead to specific design. Also, I have
found that people are more sensitive to potential problems when
they can translate an API into their own context. General
principles are often too distant for the omissions to show. John,
Leila, got the time?
Peter
----
Peter Neelin (neelin@bic.mni.mcgill.ca)