Comparing Image File Formats
by Blair RossettiMar 30, 2020
The State of Things #
There has been a lot of talk about image file formats these days. The OME team wrote a blog about it, people on the image.sc forum are discussing it, and funding groups (e.g. CZI) are putting money behind it. The AIC is certainly no stranger to the difficulties of old, outdated, and obscure image file formats. We put a fair amount of coding and computing effort into re-saving images from one format to another. In addition to the frustrations that developers are discussing on the forums, the AIC faces some unique challenges because our microscopes are all pre-commercial instruments—most of which run on LabVIEW. These instruments often spit out directory hierarchies of TIFF or binary files. The exact form of these directory hierarchies depends on the acquisition mode of the instrument and the person who wrote the control software. Since TIFF images tend to be quite large, we have been re-saving most of our data in the Keller Lab Block (KLB) format (with the exception of iPALM and FIB-SEM data). While the KLB format is fast and nicely compressible, support for KLB has apparently been dropped (as hinted to in this forum post by one of its developers). That brings us to this blog post. The AIC is now conducting a survey of image file formats to see if a champion exists… at least for our needs. The notes from our survey have been split into two parts: (1) a breakdown of the contenders and (2) a description of their pros and cons. Since the breakdown seemed more broadly useful, we posted it as a resource page. The analysis of each format, as it pertains to the AIC, is found below.
What We Need #
The AIC has several competing requirements for an efficient image format. These requirements reflect the unique operation of the AIC in which transient users generate large amounts of data that must be shipped back to their home institutions. With this in mind, the AIC requires a format that:
- allows for fast/parallel read access to regions of interest for processing,
- allows for fast/parallel write to expedite re-saving and updates,
- allows for compression to reduce overall size for data transfer,
- allows for fast file manipulations (e.g.
mv
,rm
,ls
) for data transfer and removal, - allows for work in ImageJ and Imaris,
- and is stable enough for production pipelines.
What We Want #
There are also a couple of items that we would like to have in a new image format. Specifically, we would like an image format that:
- allows for multiresolution images for enhanced visualization,
- and allows for sparse image data for storing super-resolution images.
These last two points are not strictly necessary, but they help when working with software such as Imaris and BigDataViewer.
The Contenders #
Our main contenders fall roughly into three categories. The first category, the Workhorse, includes formats that store contiguous strips or blocks of bitmap data (and possibly metadata) in a single file. These Workhorses are by far the most mainstream as they include standard graphics formats like TIFF, JPEG, and PNG. However, this category also includes the newer KLB files that we currently use in-house. I call them the Workhorse formats because most of them have been around forever, they are trustworthy and portable, and they always get the job done. Unfortunately, their lack of bells-and-whistles can make them a pain to use with multidimensional data. Let’s consider TIFF for a moment. A TIFF file can ostensibly encode an N-dimensional image, but the image data must be represented as a series of 2D slices (often called pages). In order for a TIFF reader to reconstruct the image in memory, it must know exactly how these pages were interleaved when the file was written. Unfortunately, the TIFF specifications do not require a metadata field for the dimension order. This is one reason why many people prefer to use OME-TIFF and Bio-Formats for reading/writing microscope-specific TIFF files. OME-TIFF stores metadata (including the dimension order) in an OME-XML block inside the image file. The Bio-Formats reader can parse this OME-XML block to know exactly how to reconstruct the multidimensional image. The tradeoff, however, is that the OME-TIFF specifications currently1 restrict images to five dimensions (XYZCT).
The second category, the Jack of All Trades, consists of single file formats that contain their own internal file systems. Here we are largely referring to the HDF5 format and variants thereof. I call them the Jack of All Trade formats because HDF5 files can store many different types of data as abstract objects within a single file. Moreover, each object can have its own attributes (i.e., metadata) and can be linked or grouped with other objects. HDF5 files acts much like Unix file systems, and accessing a piece of data is nearly identical to accessing a file in a directory. We can store image data in an HDF5 file as one contiguous chunk or in chunks of a defined size. The chunks are organized in a B-tree which makes reading regions of interests extremely efficient. Since HDF5 provides such an abstract data model, there have been many different implementations of this format in the microscopy realm including: Imaris (.ims
), BigDataViewer (.h5
), and CellH5 (.ch5
).
The third category, the Newcomers, include formats that leverage the native file system to house directory hierarchies of image chunks. The biggest three formats in this category are tileDB, N5, and Zarr. Rather than store image chunks in an internal file system, as with HDF5, these formats store chunks within the native file system containers or directories. As a consequence, they are the only formats that can trivially read and write image chunks in parallel. As their nickname suggest, these file formats are relatively new to the imaging scene and their specifications are still being defined.
Before we continue with the pros and cons of these formats, you might want to take a look at our Image File Formats resource page.
1. Workhorse Formats #
In general, these formats are extremely stable and have been around for a long time. This is great from a production standpoint because updates (if there are any) are unlikely to break your code. However, stability is a double-edged sword. Many of these formats have not aged well, and they cannot easily handle modern microscopy data. For instance, terabyte-scale images have revealed the need for both multiresolution image hierarchies and image chunking. Although pyramidal image data can be stored in OME-TIFF files, they can be difficult or impossible to handle with most Workhorse formats. Similarly, TIFF allows for chunking data into tiles for faster random read access, but this implementation is decidedly less flexible than formats from the other two categories.
2. Jack of All Trades Formats #
I consider the Jack of All Trades formats to be a nice trade-off between stability and flexibility. HDF5 has been around since the early 2000s, but is has a clever and efficient internal structure. In fact, B-tree organization of matrix data is, on average, faster than most memory-mapped files (see this Stack Overflow post). HDF5 formats are not without their limitations. Storing everything in a single file can be dangerous. As Tobias Pietszch pointed out on the this image.sc post:
We chose [HDF5] exactly because it provides the capabilities mentioned above: chunked datasets (blocked images) and many datasets in one file (filesystem in a file, storing multiple resolutions, timepoints, channels, …). It has some serious drawbacks that let me doubt whether we would choose it again. In particular, it doesn’t support multithreaded writing and it has no journaling, i.e., if the computer crashes while writing/modifying a huge HDF5 file, it is likely that the whole file is unrecoverably corrupted. –Tobias Pietszch
I will add a minor addendum that parallel HDF5 supports multithreaded writes on parallel file systems that support MPI-IO. That said, parallel file systems are not typically found outside of data centers.
3. Newcomer Formats #
These formats are largely responsible for the recent chatter about image file formats. The trio of tileDB, N5, and Zarr are certainly the formats (or format prototypes) of the future. They have been developed with the goal of making distributed computing easier. By representing images as chunks of data within a file system container, details of the access protocols are left to the operating system or cloud compute/storage platform. In other words, these image chunks can live as easily in a local directory as they can in an Amazon S3 bucket. A big advantage of storing every image chunk as an individual file is that chunks can be written or read in parallel. Again, we have a double-edged sword. Our image goes from being a single big file to being a single directory of many (often many many) small files. What results is an extremely difficult to move, delete, or index image. These formats are extremely appealing, but we are still in early days.
Caveats and Considerations #
So far we have considered file formats in three broad categories. Yet formats within a category can be quite different in their implementation and restrictions. In the section below, I have made several notes about the file formats that we are most interested in exploring. I have specifically focused on advantages and disadvantages there were not already covered above.
Keller Lab Block (.klb
)
#
- allows for parallel read/write; however, parallel writes are dependent on the block size
- restricted to five-dimensional data
- only supports 256 characters of user-defined metadata
- poor documentation
- lacks support
The larger the block size, the better the KLB compression ratio; however, this ratio reaches saturation already for relatively small blocksizes. Read and write times are not optimal for extreme block sizes, i.e. both for very small and for very large blocks. If blocks are too small, communication overhead in processing threads becomes an issue. If blocks are too large, computations cannot be parallelized to the maximum extent (in the most extreme scenario, a single thread has to handle the entire image) –Amat et al. 2015 supplemental
HDF5 (.h5
)
#
- allows for parallel read, but not parallel write
- parallel writing of multiple files requires a workaround (see forum posts on image.sc and hdfgroup.org)
- extensive documentation
- reading on NFS storage clusters can be slower than local storage since hierarchical access in not available
pHDF5 (.h5
)
#
Parallel HDF5 is not a separate file type. It has the same form as serial HDF; however, there are several caveats in using this API.
- allows parallel write on a parallel file system with MPI-IO
- support for compression is limited
CellH5 (ch5
)
#
- variant of HDF5
- designed for high-content screening data
- poor documentation
- lacks support
BigDataViewer (.xml
,.h5
)
#
- HDF5 variant
- contains multiresolution data
- metadata is stored external in an XML file
- poor documentation
- support appears to be dropping in favor of N5
Imaris (ims
)
#
- HDF5 variant
- open sourced in 2015, but linked to a for-profit company
- contains multiresolution data
- limited documentation
NetCDF (.nc
,.cdl
)
#
- consists of several different abstracted format types including HDF5
- NetCDF team has been working to include Zarr as a format (see NCZarr Overview)
- used as a dependency in Bio-Formats
- performance depends on which version file format is used
- has metadata size limits that has caused problems for large files (see forum posts [1] & [2])
- reads can be slow due to metadata overhead
- detailed documentation
NetCDF-4 provides parallel file access to both classic and netCDF-4/HDF5 files. The parallel I/O to netCDF-4 files is achieved through the HDF5 library while the parallel I/O to classic files is through PnetCDF. A few functions have been added to the netCDF C API to handle parallel I/O. You must build netCDF-4 properly to take advantage of parallel features (see Building with Parallel I/O Support). –NetCDF Users Guide
The use of HDF5 as a data format adds significant overhead in metadata operations, less so in data access operations. We continue to study the challenge of implementing netCDF-4/HDF5 format without compromising performance. –NetCDF Users Guide
tileDB (.tdb
)
#
- like N5/Zarr, tileDB is hampered when too many fragments are created (see github issue)
- open source, but linked to a for-profit company
- detailed documentation
Zarr (.zarr
)
#
- performance is heavily dependent on chunk size (see github issue)
- mutliresolution data has not been completely sorted out yet (see github issues [1] & [2])
- poor documentation
- written in python
N5 (.n5
)
#
- supports multiresolution data
- performance is heavily dependent on chunk size
- developed in-house at Janelia
- great integration with ImageJ/Fiji
- poor documentation
Punch Line #
Without a doubt, tileDB/Zarr/N5 are the most popular formats right now. As such, they are getting a lot of support and development. Of these three, I prefer tileDB for the stability of the ecosystem, documentation, and choice of language (C++). Unfortunately, all of these formats are too new for production use at the AIC.
The only stable format that can be used in both Imaris and ImageJ is the Imaris 5.5 format. With this background research in mind, the most logical next step is to benchmark several of the top contenders. We will explore the feasibility of writing a parallel converter to move data from our lattice light sheet instruments to an Imaris 5.5 compatible format. This will also give us the opportunity to experiment with variants of the Imaris 5.5 format that balance parallel read/write capability with data chunking. For instance, there is some precedent for linking multiple HDF5 files in a tileDB/N5/Zarr hierarchy. Stephan Saalfeld has already attempted to write HDF5 files in an N5 paradigm (see n5-hdf5). Christian Tischer wrote a Java-based IMS writer that uses the linking ability of HDF5 to write several .h5
files that act as one Imaris file.
-
Efforts are being made to allow for high-dimensional image data within the OME-TIFF spec (see https://docs.openmicroscopy.org/ome-model/6.1.0/developers/6d-7d-and-8d-storage.html) ↩︎
Last modified Jul 15, 2020