Doing lots of things at once...

I am very busy now in the lead up to Christmas. I've got a sort of deadline of January to get some production simulations run so that I can write some papers for submission both to journals, and for talks (so any hints of good places for me to talk at are welcome ;-)

This means that I am now having to get the code to a point where it can do reliable and efficient simulation. My success with the Molpro forcefield meant that my next focus was to allow this forcefield to run in a background thread (so that the QM calculation could be performed at the same time as the MM, and also, for a free energy perturbation, the QM energies of the reference and perturbed states could be calculated simultaneously). My initial plan was to write something quick and dirty to get the job done (as I had in my simple python QM/MM script). However, I have a lot of reluctance to put quick and dirty things into Sire. I am still unhappy with the Molecule/ParameterTable split, and got very uncomfortable adding yet more dodgy code. The result was that I sat down and *finally* worked out how to do networking in Sire. Its amazing - I lost months at the beginning of this year trying to sort out the networking, and trying out lots of different designs, before finally scrapping the whole lot and deciding to plough ahead without anything. Yet this week, I managed to do the whole lot, from planning, through implementation and testing. I think that it was because I finally knew what I wanted the networking to achieve, and because the right design had popped into my head (its amazing how good software designs can just sneak up on you - normally when day dreaming about something else entirely!). Earlier this year I was struggling with the networking because I was trying to use unique objects, i.e. all objects were given a unique ID number, and *somehow* their state was managed across an entire cluster. No matter how I tried, this scheme always ended in failure as it was not possible to write efficient energy routines for objects whose state could be changed by remote threads or processors. It was because of this that I junked that whole design, and instead switched to the implicitly shared, self-managed object pattern which is now common in Sire. I knew that this design was implicitly compatible with networking, as each object was self-managing, so could be copied in its entirety to another processor or thread. The remote processor or thread could then work on the copy without worrying about any part of it being changed by another processor or thread (as the object can only be changed by its public API, and doesn't expose any of its internal data). However, knowing that self-managed objects were implicitly safe with networking, and actually implementing networking with self-managed objects are two different things. Indeed, there is one huge problem....

...self-managed objects can be copied, and indeed there could be hundreds of copies of the same object.

...however, there is only a single copy of a thread, or an MPI processor. Thus network resources *cannot* be held by self-managed objects.

...thus we have two rules.

(1) Self-managed, copyable objects may not contain non-copyable objects like network resources, and, by implication

(2) networking must be performed by placing self-managed, copyable objects into a non-copyable object.

These rules imply a pretty simple design. Each type of network resource is handled by a different class, e.g. ThreadWorker for a thread, or MPIWorker for an MPI node, or PoolWorker for a resource from a pool of network resources (with the potential for the actual thread or node to change transparently to the application). I then have a user-facing class that represents the network resource, but which can be copied and passed around (and edited), e.g. ThreadProcessor (corresponding to ThreadWorker) and MPIProcessor (corresponding to MPIWorker). I could then define a type of calculator, which can be used to perform calculations on that resource, e.g. FFCalculator, which is used to perform calculations on a ForceField. All that is then needed is for the Processor class to be activated with a particular Calculator class, which then returns a Worker class that performs the calculation described by the Calculator on the network resource controlled by the Worker. To give an example;


thread_proc = FFThreadProcessor() # create a processor that is a background thread

ff = InterCLJFF() # create a forcefield class to calculate CLJ energies

ff.addMolecules( molecules, parameters ) # add copies of the molecules

thread_proc.setForceField(ff) # give a copy of the forcefield to the processor

active_proc = thread_proc.activate() # activate the processor

# the active processor in a unique object which cannot be copied

# The active processor contains a copy of the forcefield, and has
# a forcefield like interface. It can be queried to return its copy
# of the forcefield.

# tell the active processor to start calculating the energy
# (so now done in a background thread)
active_proc.recalculateEnergy()

#get the resulting energy - this blocks until the processor
#has finished
nrg = active_proc.energy()

While it might not look it, this design means that *all* of the thread handling code is in ThreadWorker (which is returned when the processor is activated). This means that it is trivial to parallelise the evaluation of the forcefield as there is no need for the forcefield class itself to worry about mutexes etc. Evaluating multiple forcefields in parallel is thus now possible and is straightforward. I am planning to make different calculators that work with Simulation objects, so that it will be possible to run multiple simulations in parallel. This scheme also allows for less specific network devices to be used, e.g. PoolWorkers, which use a pool of network resources and can move jobs between the members of the pool to manage failure and load balancing.

The only slight mess is that it requires some tweaking to allow for forcefields that want to parallelise their own calculation. These will need to use specific processors, and will not be able to evaluate their energies or forces on their own, but will need to delegate the calculation back to the specific processor that they are on (as the forcefield could be copied, and thus cannot contain information about individual network resources). This also means that saving and restoring a simulation becomes a little more messy, as the network resources cannot be saved or restored - only the meta-information about those resources.