Iterating SWIFTGalaxies

When similar operations need to be applied to multiple SWIFTGalaxy objects it often makes sense to iterate over them (possibly in parallel, see below). The SWIFTGalaxies class aims to make this convenient and efficient, mostly by internally managing the order of iteration to avoid re-reading data from disk as much as possible - most often when two objects of interest share a common top-level cell in a SWIFT simulation.

Setting up a SWIFTGalaxies instance is deliberately designed to be as similar to creating a SWIFTGalaxy as possible. There is one difference: the halo catalogue helper class is initialized with a list of targets instead of a single target identifier.

The SWIFTGalaxies class otherwise broadly accepts the same initialization arguments as the SWIFTGalaxy class, such as auto_recentre.

It is crucial to know that the order of iteration used by a SWIFTGalaxies is internally optimized; see the order of iteration section below.

Multiple-target halo catalogues

The halo catalogue helper classes (such as SOAP, Velociraptor, Caesar, …) can be initialized with a list of targets instead of a single target identifier (numpy arrays or unyt arrays are also OK). This results in some new behaviour.

When a halo catalogue is created with multiple targets, accessing attributes of the object (such as via the SWIFTGalaxy halo_catalogue attribute) will return the relevant property of all targets in the catalogue. For example:

soap = SOAP("my_soap_catalogue.hdf5", soap_index=[11, 22, 33])
m200 = soap.spherical_overdensity_200_crit.total_mass

sets m200 to something like cosmo_array([1.2e+12, 3.4e+12, 5.6e+12], dtype=float32, units='1.98841586e+30*kg', comoving=False). Keep in mind that the format of halo catalogue values depends on the halo catalogue, keeping formats that should already be familiar to users of each catalogue type.

Once iteration over the targets contained in a SWIFTGalaxies begins, this behaviour changes: during an iteration step, the catalogue is “masked” down to a single object and behaves just like it would for a single-target SWIFTGalaxy:

sgs = SWIFTGalaxies("my_snapshot.hdf5", soap)
for sg in sgs:
   m200 = sg.halo_catalogue.spherical_overdensity_200_crit.total_mass

Then on each iteration the value of m200 will look similar to cosmo_array([1.2e+12], dtype=float32, units='1.98841586e+30*kg', comoving=False).

Order of iteration

Warning

The most important thing to remember when using SWIFTGalaxies is that it determines the best order to iterate over your chosen galaxies itself to minimize I/O operations. Be careful not to assume that the iteration is in the order of the target identifiers that you passed to the halo catalogue helper class (such as SOAP).

The main purpose of the SWIFTGalaxies class is to determine an order to iterate through the list of target objects without duplicating I/O operations more than necessary.

In brief, the class evaluates two iteration schemes at initialization and chooses the one that will be most efficient. The first scheme is the “sparse” solution. This determines which top-level cells need to be read for each target galaxy and groups targets together when they share a common region. The second scheme is the “dense” solution. This determines the largest region needed by any single target and then covers the simulation volume in regions of that size on a grid. Targets are then assigned to the region with the closest centre to collect them into groups. Any regions without targets in them are discarded. The bounding box of each region is then adjusted to the minimum extent needed to contain the targets assigned to it. This result in a final set of (probably overlapping) regions. This second scheme is most often optimal when there are many closely packed targets that often overlap the boundaries of top-level cells, such as when iterating over all galaxies in a simulation volume.

The iteration order of the targets is available from the iteration_order property. However, if obtaining ordered results is required, using the map() method is usually a more convenient approach than e.g. sorting the output of iteration in the optimal order.

Map

The swiftgalaxy.iterator.SWIFTGalaxies.map() function can be used to apply a function to the collection of SWIFTGalaxy’s represented by the SWIFTGalaxies object (one at a time), and obtain the results. As a simple example:

from swiftgalaxy import SWIFTGalaxies, SOAP

# define the function that we will apply to each SWIFTGalaxy object:
def dm_median_position(sg):
    return np.median(sg.dark_matter.coordinates, axis=0)

sgs = SWIFTGalaxies(
    "my_snapshot.hdf5",
    SOAP(
        "my_soap.hdf5",
        soap_index=[11, 22, 33],
    ),
)
my_result = sgs.map(dm_median_position)

The my_result variable will be a list containing the result of the dm_median_position function applied to the galaxies at soap_index=11, 22 and 33 in that order. If the list [22, 33, 11] was instead passed to the SOAP (or other halo catalogue class from swiftgalaxy), the output order would change correspondingly (even though internally the galaxies are most likely iterated over in the same order in both cases). Using map() is in general not equivalent to:

my_result = [dm_median_position(sg) for sg in sgs]

Relying on the iteration order for sg in sgs should be avoided if the order of iteration or output matters, but can be appropriate if order is unimportant (such as a function or operation that produces an output file for each input SWIFTGalaxy named according to its unique identifier).

The map() function can also accept additional argument values and/or keyword argument values. For example, the following code:

sg1 = SWIFTGalaxy(...)
sg2 = SWIFTGalaxy(...)
my_result = [my_func(sg1, 123, extra_data=456), my_func(sg2, 789, extra_data=None)]

Can be more succinctly written as (and will run more efficiently as):

sgs = SWIFTGalaxies(...)  # contains the same galaxies as sg1 and sg2, in that order
my_result = sgs.map(
   my_func,
   args=[(123, ), (789, )],
   kwargs=[dict(extra_data=456), dict(extra_data=None)]
)

Notice especially that the argument lists are bundled in tuple’s (of one element, in this case). The comma in (123, ) is therefore not optional. If the function accepted two arguments they could be passed as something like args=[(123, "abc"), (456, "def")]. Additional keyword arguments can similarly be added by adding additional dict entries.

Parallel iteration

There is an obvious opportunity to support iterating over SWIFTGalaxy objects in parallel through the SWIFTGalaxies interface. The initial release of the SWIFTGalaxies feature has omitted this to focus on a working serial implementation first. Tools for parallel analysis are planned for a future release.