parallelize.sh: run commands in parallel with bash

Recently I'm doing some heavy research work. One part of my work involves invoking a simulation script with different inputs and parameters and then an analysis script to analyze the simulation output.

At first, this is an easy bash loop:

(
  echo 3 11
  echo 3 11
  echo 5 19
  echo 5 19
) | while read -r -a L; do
  X=${L[0]}
  Y=${L[1]}
  python2 simulation.py --x=$X --y=$Y < input.tsv > $X-$Y.simulation.log
  gawk -f analysis.awk $X-$Y.simulation.log > $X-$Y.analysis.tsv
done

The loop works fine, but it takes too long time when the input gets larger, because scripts are running sequentially. Since we have a big server with 32 CPU cores, can I run the scripts in parallel?

So I wrote this nifty little script, parallelize.sh:

#!/bin/bash
# Run commands in parallel.
# Usage: JOBS=8 ./parallelize.sh < commands.lst
#   JOBS: number of subprocesses
#     If omitted, use number of CPUs.
#     If specified as AxB, A subprocesses are running in parallel,
#     and JOBS environ passed to subprocesses is B.
#   stdin: list of commands

# Copyright 2016 Arizona Board of Regents
# GNU Lesser General Public License version 3 or later

JOBS=${JOBS:-$(grep -c ^processor /proc/cpuinfo)}
JOBS1=$(echo $JOBS | cut -dx -f1)
JOBS2=$(echo $JOBS | cut -dx -sf2-)
JOBS2=${JOBS2:-1}

while read -r CMD; do
  while [[ $(jobs -p | wc -l) -ge $JOBS1 ]]; do
    sleep 0.1
  done
  if [[ -n $PARALLEL_VERBOSE ]]; then
    echo "$CMD" >/dev/stderr
  fi
  JOBS=$JOBS2 bash -c "$CMD" &
done
wait

And the loop in the beginning can be converted to:

(
  echo 3 11
  echo 3 11
  echo 5 19
  echo 5 19
) | while read -r -a L; do
  X=${L[0]}
  Y=${L[1]}
  echo -n "python2 simulation.py --x=$X --y=$Y < input.tsv > $X-$Y.simulation.log"
  echo -n " ; "
  echo -n "gawk -f analysis.awk $X-$Y.simulation.log > $X-$Y.analysis.tsv"
  echo
  # Since analysis step depends on the output of simulation, they cannot run in parallel.
  # Thus, they have to go on the same line to be executed sequentially.
done | ./parallelize.sh

The parallelize.sh script can be nested as well.

# vary-y.sh
(
  echo 11
  echo 13
  echo 17
  echo 19
) | while read -r -a L; do
  Y=${L[0]}
  echo -n "python2 simulation.py --x=$X --y=$Y < input.tsv > $X-$Y.simulation.log"
  echo -n " ; "
  echo -n "gawk -f analysis.awk $X-$Y.simulation.log > $X-$Y.analysis.tsv"
  echo
done | ./parallelize.sh

# vary-x-y.sh
(
  echo 2
  echo 3
  echo 5
  echo 7
) | while read -r -a L; do
  X=${L[0]}
  echo "X=$X ./vary-y.sh"
done | ./parallelize.sh

# -- invocation --
JOBS=2x3 ./vary-x-y.sh
# At most 2 vary-y.sh subprocesses will be running at the same time.
# JOBS=3 will be passed to vary-y.sh and applied to its nested parallelize.sh call, so that
# at most 3 simulation+analysis scripts can be running at the same time within each vary-y.sh process.

Now my experiments can run 32 times faster. And I have less chance to use "my simulation is running" as an excuse.

Tags: Linux bash