The "old" and new parallel MPI versions
NOTE: It is important to understand that the MPI version of GARLI after 0.96 (0.96, 1.0 and 2.0) are completely different in intent and usage from the previous GARLI MPI version 0.94.
- Old parallel method (0.94): Used a meta-population genetic algorithm (one population per processor) to perform a single coordinated search. The intent was to make searches BETTER, not FASTER. The algorithm itself was very different from the serial (single processor) version. It was only really helpful on very large datasets (500+ sequences), and even then was not necessarily the best use of resources. It does not contain many of the improvements in terms of models and features added in versions after 0.94, and thus may not be worth using.
- New parallel method (in versions 0.96, 1.0 and 2.0): Uses multiple processors to simultaneously run a number of independent search replicates or bootstrap replicates. The intent here is to get a number of searches done FASTER, but not to search any differently. The results given by each of the processors are exactly what one would get by doing an equal number of runs with the non-parallel version. This allows all of the models implemented in the serial version of .
The source code for MPI version 0.94 is available at the bottom of the old download website: Download page Documentation for it appears in the manual for versions 0.94 or 0.95.
The new MPI version is described in detail below, and the source code is available as part of the current distribution at: Download page
I do intend to release a version at some point that uses the core code of newer versions with the parallel strategy of the first MPI version.
The parallel MPI version of GARLI 2.0
The MPI version of the program is mainly for use on large computing clusters, although you can compile and use it on stand-alone machines with multiple processors running Linux or Mac OS X 10.5. You'll either need to download and compile the program yourself or have your system administrator do it. (See the INSTALL file with the source code for information on how to compile the MPI version)
The purpose of the MPI version of GARLI 2.0 is very simple: to divide a number of search replicates or bootstrap replicates across multiple processors. It will NOT make each individual search/bootstrap replicate faster (the multithreaded version will), but will rather do a number of them in parallel. It will also NOT improve the thoroughness of the search algorithm in any way. Doing 10 searches in parallel on a cluster is exactly equivalent to doing 10 searches using the serial program on your desktop machine.
When you might use the MPI version
- You have access to a computer cluster
- You want to perform MULTIPLE search replicates or bootstrap replicates
- Your cluster doesn't allow or discourages non-parallel (serial) programs
- Your cluster is heterogeneous (made up of machines of different speeds/types)
- You have "low priority" access to some processors on a cluster, meaning that processors executing some of your searches can essentially be "paused" when high priority jobs are submitted by other users
When not to use the MPI version
- You want to do a more thorough search
- You want to finish a single search replicate quickly
How to use the MPI version of GARLI 2.0
Configuring and submitting a run
Exactly how to submit parallel jobs on a cluster is highly variable, and depends on how it is configured and the queuing system that is used. You should first familiarize yourself with the documentation for your specific cluster. In general, the command to start a parallel MPI analysis is "mpirun". You should probably NOT simply type this at the command line! It will most likely be entered in a script that is passed to the queue managing program. Your command might look something like this:
mpirun -np <# of processors to use> <MPI GARLI executable name> <# of times to execute the GARLI "garli.conf" file>
A specific example might be:
mpirun -np 4 Garli-2.0.mpi 20
The "-np 4" there is an argument to mpirun, and means that 4 processors will be used on the cluster. The way that this is specified on your cluster might vary.
The most important (and confusing) part about running this version of GARLI is the last argument, the "20" in the above example. This is information that is going to GARLI, not to mpirun. What this tells GARLI is just what it says - the number of times to execute the configuration file. This is easiest to explain by example.
Lets say that your garli.conf file was set to do one search replicate, i.e.,
searchreplicates = 1
The above example specified that the configuration file would be executed 20 times, giving 20 x 1 = 20 search replicates in total. However, only 4 processors were requested. Your job progresses like this:
- Each of the 4 processors executes the garli.conf file (each behaves exactly as if you had executed the same garli.conf file on your desktop machine)
- The 4 searches progress
- After a while, one of the 4 processors will finish its search. Because the search is stochastic, there's no way to tell which this will be.
- The newly freed processor executes the garli.conf file again, starting the 5th search.
- Other processors finish their runs, and each immediately executes the garli.conf file again (the 6th, 7th, 8th searches, etc.)
- This process continues until all 20 executions have been completed, at which point your overall MPI job is complete.
At this point you've used 4 processors to complete 20 search replicates, getting those searches completed about 4x faster than you would have on a single machine. There are many other ways that you could have accomplished the same thing by changing the "searchreps" entry in your garli.conf file and your mpirun command. For example:
searchreplicates = 1
mpirun -np 10 Garli-2.0.mpi 20
would execute the garli.conf file 20 times (as before), but would use 10 processors to do it. This would get the 20 reps done about 10x faster than on a single machine. or
searchreplicates = 2
mpirun -np 2 Garli-2.0.mpi 10
would execute the garli.conf file 10 times, with each execution doing of 2 search replicates (10 x 2 = 20 replicates). Only 2 processors would be used in this case, making the speed gain only 2x.
So, a few important things to note:
- Performing bootstrap replicates in parallel is done exactly as with search replicates, just set the number in the garli.conf file with "bootstrapreps" instead of "searchreps"
- To avoid wasting computational resources, you should always make the number of times to execute the garli.conf file a multiple of the number of processors used. i.e., for 20 executions 2, 4, 5 or 10 processors are ok, but not 6 or 8.
- If your cluster is made up of machines of different speeds or some of the processors that you are using could be co-opted by higher priority jobs, it can be a good idea to specify few replicates per garli.conf file, and many more executions than the number of processors. For example, if you are doing 100 bootstrap replicates, specify "bootstrapreps = 1" in the garli.conf file, and use 4 processors with 100 executions. This way all 100 replicates will be completed regardless, and it is OK if one processor only manages to finish 22 replicates in the time it takes another to complete 26. Even if all of your processors are identical, this can be a good strategy because the runtime of each search can vary quite a bit because of the random nature of the search.
- To reiterate, the algorithm and results generated by an MPI run are no different from what one would get using the same number of serial runs. This is exactly the same situation as with MrBayes, except that there only one execution of the whole algorithm may be necessary. By dividing up the work among processors, it runs that much faster, but the results are exactly identical. In the GARLI case the parallelism is even more blatant, in that multiple search replicates will always need to be done, either in multiple program executions of in a single one. By running the individual replicates in parallel (either via the MPI wrapper or as individual serial executions), the necessary amount of work also gets done faster.
The output and results of MPI GARLI runs
The results of an MPI run will be spread across a number of result files (just as they would in doing multiple serial executions of the program). As for how to deal with this distribution of results, take a look at these sections of the Advanced topics page:
For normal or bootstrap searches: Advanced topics: Examining/collecting results
For bootstrap searches specifically: Advanced_topics: A bootstrap analysis