resumable batch jobs with gnu parallel

gnu parallel is the shit. it's essentially a map/reduce, pmap, or ThreadPoolExecutor, but on the command-line.

many a soul may have heard of it but was intimidated by the docs, since the tool does have a lot of options, and the authors did not bother with css.

i came to parallel in a time of need and it delivered, so now i'm here to bring you the good word.

demo

summary

wtf did you just watch?

tldr: gnu parallel allows you to easily construct resumable batch pipelines.

background

if you ever ran code in production, chances are you've come across the need to perform some command in batches. often times this will involve cleanup operations on some data store.

the particular case i found myself dealing with was deleting a set of keys in redis.

on the surface, this seems quite straight-forward. you can loop over the keys and delete them:

$ for key in $(cat keys.txt); do redis-cli del $key; done

there are many requirements that this does not address however:

especially pause/resume is crucial -- and tends to be messy to implement by hand.

that's where gnu parallel comes in.

breakdown

looking at the command from the demo:

$ cat keys.txt | parallel --bar --joblog +keys.joblog --resume -n 1000 'redis-cli del {}'

the input file in this case is a newline-delimited list of keys.

what this command actually does is split the input into batches of 1k, and invoke the provided redis-cli command with a full batch of arguments. so it will end up executing something along the lines of:

redis-cli del key0000 key0001 key0002 key0003 ...
redis-cli del key1000 key1001 key1002 key1003 ...
redis-cli del key2000 key2001 key2002 key2003 ...

breaking it down into its components:

$ cat keys.txt | \
    parallel
      --bar \                 # display a pretty progress bar
      --joblog +keys.joblog \ # track the commands processed in a joblog file
      --resume \              # resume from the previous position
      -n 1000 \               # batch input into batches of 1k
      'redis-cli del {}'      # command to be run, {} is replaced with the batch

by default, gnu parallel will run as many parallel tasks as there are logical cpu cores, but this can be overridden via -j.

in case you need to rate-limit the process, you can inject some latency via --delay 1s, effectively limiting the deletion to at most 1000 keys per second.

pipe

if the target command supports input via pipe, you can do even better:

not only does this pipeline the commands at the network-level (in the case of redis-cli), it also spawns much fewer processes.

$ cat keys.txt | awk '{ print "del " $0 }' | parallel --joblog +keys.joblog --resume --pipe 'redis-cli --pipe'

what this command does is spawn one new redis-cli process per 1MB block of input (though the block size can be overridden via --block), and pipes that block into it.

something along these lines:

echo 'del key0000\ndel key0001\ndel key0002\n...' | redis-cli --pipe
echo 'del key3567\ndel key3568\ndel key3569\n...' | redis-cli --pipe

breakdown:

$ cat keys.txt | \
    awk '{ print "del " $0 }' | \ # pre-process keys to form del commands
    parallel \
      --joblog +keys.joblog \
      --resume \
      --pipe \                    # pass input via a pipe instead of args
      'redis-cli --pipe'          # redis-cli accepts stream of commands via pipe

just like the previous example, this can also be interrupted and resumed at any point.

further reading

this post covered only a small subset of what parallel can do. it is capable of spreading jobs across a cluster of CPUs, retrying failed jobs, and more.

brought to you by @igorwhilefalse