Recently I'm doing some heavy research work. One part of my work involves invoking a simulation script with different inputs and parameters and then an analysis script to analyze the simulation output.
At first, this is an easy bash loop:
(
echo 3 11
echo 3 11
echo 5 19
echo 5 19
) | while read -r -a L; do
X=${L[0]}
Y=${L[1]}
python2 simulation.py --x=$X --y=$Y < input.tsv > $X-$Y.simulation.log
gawk -f analysis.awk $X-$Y.simulation.log > $X-$Y.analysis.tsv
done
The loop works fine, but it takes too long time when the input gets larger, because scripts are running sequentially. Since we have a big server with 32 CPU cores, can I run the scripts in parallel?
So I wrote this nifty little script, parallelize.sh:
#!/bin/bash
# Run commands in parallel.
# Usage: JOBS=8 ./parallelize.sh < commands.lst
# JOBS: number of subprocesses
# If omitted, use number of CPUs.
# If specified as AxB, A subprocesses are running in parallel,
# and JOBS environ passed to subprocesses is B.
# stdin: list of commands
# Copyright 2016 Arizona Board of Regents
# GNU Lesser General Public License version 3 or later
JOBS=${JOBS:-$(grep -c ^processor /proc/cpuinfo)}
JOBS1=$(echo $JOBS | cut -dx -f1)
JOBS2=$(echo $JOBS | cut -dx -sf2-)
JOBS2=${JOBS2:-1}
while read -r CMD; do
while [[ $(jobs -p | wc -l) -ge $JOBS1 ]]; do
sleep 0.1
done
if [[ -n $PARALLEL_VERBOSE ]]; then
echo "$CMD" >/dev/stderr
fi
JOBS=$JOBS2 bash -c "$CMD" &
done
wait
And the loop in the beginning can be converted to:
(
echo 3 11
echo 3 11
echo 5 19
echo 5 19
) | while read -r -a L; do
X=${L[0]}
Y=${L[1]}
echo -n "python2 simulation.py --x=$X --y=$Y < input.tsv > $X-$Y.simulation.log"
echo -n " ; "
echo -n "gawk -f analysis.awk $X-$Y.simulation.log > $X-$Y.analysis.tsv"
echo
# Since analysis step depends on the output of simulation, they cannot run in parallel.
# Thus, they have to go on the same line to be executed sequentially.
done | ./parallelize.sh
The parallelize.sh script can be nested as well.
# vary-y.sh
(
echo 11
echo 13
echo 17
echo 19
) | while read -r -a L; do
Y=${L[0]}
echo -n "python2 simulation.py --x=$X --y=$Y < input.tsv > $X-$Y.simulation.log"
echo -n " ; "
echo -n "gawk -f analysis.awk $X-$Y.simulation.log > $X-$Y.analysis.tsv"
echo
done | ./parallelize.sh
# vary-x-y.sh
(
echo 2
echo 3
echo 5
echo 7
) | while read -r -a L; do
X=${L[0]}
echo "X=$X ./vary-y.sh"
done | ./parallelize.sh
# -- invocation --
JOBS=2x3 ./vary-x-y.sh
# At most 2 vary-y.sh subprocesses will be running at the same time.
# JOBS=3 will be passed to vary-y.sh and applied to its nested parallelize.sh call, so that
# at most 3 simulation+analysis scripts can be running at the same time within each vary-y.sh process.
Now my experiments can run 32 times faster. And I have less chance to use "my simulation is running" as an excuse.