Notes for development using spectral-cube

spectral-cube is flexible and can used within other packages for development beyond the core package’s capabilities. Two significant strengths are the use of memory-mapping and the integration with dask (Integration with dask) to efficiently handle larger than memory data.

This page provides suggestions for software development using spectral-cube in other packages.

The following two sections give information on standard usage of SpectralCube. The third discusses usage with dask integration in DaskSpectralCube.

Handling large data cubes

spectral-cube is specifically designed for handling larger-than-memory data and minimizes creating copies of the data. SpectralCube uses memory-mapping and provides options for executing operations with only subsets of the data (for example, the how keyword in SpectralCube.moment()).

Masking operations can be performed “lazily”, where the computation is completed only when a view of the underlying boolean mask array is returned. See Masking for details on these implementations.

Further strategies for handling large data is given in Handling large datasets.

Parallelizing operations

Several operations implemented in SpectralCube can be parallelized using the joblib package. Builtin methods in SpectralCube with the parallel keyword will enable using joblib.

New methods can take advantage of these features by creating custom functions to pass to SpectralCube.apply_function_parallel_spatial() and SpectralCube.apply_function_parallel_spectral(). These methods accept functions that take a data and mask array input, with optional **kwargs, and that return an output array of the same shape as the input.

Unifying large-data handling and parallelization with dask

spectral-cube’s dask integration unifies many of the above features and further options leveraging the dask ecosystem. The Integration with dask page provides an overview of general usage and recommended practices, including:

  • Using different dask schedulers (synchronous, threads, and distributed)

  • Triggering dask executions and saving intermediate results to disk

  • Efficiently rechunking large data for parallel operations

  • Loading cubes in CASA image format

For an interactive demonstration of these features, see the Guide to Dask Optimization.

For further development, we highlight the ability to apply custom functions using dask. A DaskSpectralCube loads the data as a dask Array. Similar to the non-dask SpectralCube, custom functions can be used with DaskSpectralCube.apply_function_parallel_spectral() and DaskSpectralCube.apply_function_parallel_spatial(). Effectively these are wrappers on dask.array.map_blocks and accept common kwargs.

Note

The dask array can be accessed with DaskSpectralCube._data but we discourage this as the builtin functions include checks, such as applying the mask to the data.

If you have a use case needing on of dask array’s other operation tools please raise an issue on GitHub so we can add this support!

The Integration with dask page gives a basic example of using a custom function. A more advanced example is shown in the parallel fitting with dask tutorial. This tutorial demonstrates fitting a spectral model to every spectrum in a cube, applied in parallel over chunks of the data. This fitting example is a guide for using DaskSpectralCube.apply_function_parallel_spectral() with:

  • A change in array shape and dimensions in the output (drop_axis and chunks in dask.array.map_blocks)

  • Using dask’s block_info dictionary in a custom function to track the location of a chunk in the cube