xargs.html - www.codemadness.org - www.codemadness.org saait content files
 (HTM) git clone git://git.codemadness.org/www.codemadness.org
 (DIR) Log
 (DIR) Files
 (DIR) Refs
 (DIR) README
 (DIR) LICENSE
       ---
       xargs.html (9768B)
       ---
            1 <!DOCTYPE html>
            2 <html dir="ltr" lang="en">
            3 <head>
            4         <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
            5         <meta http-equiv="Content-Language" content="en" />
            6         <meta name="viewport" content="width=device-width" />
            7         <meta name="keywords" content="xargs, wow hyper speed" />
            8         <meta name="description" content="xargs: an example for parallel batch jobs" />
            9         <meta name="author" content="Hiltjo" />
           10         <meta name="generator" content="Static content generated using saait: https://codemadness.org/saait.html" />
           11         <title>xargs: an example for parallel batch jobs - Codemadness</title>
           12         <link rel="stylesheet" href="style.css" type="text/css" media="screen" />
           13         <link rel="stylesheet" href="print.css" type="text/css" media="print" />
           14         <link rel="alternate" href="atom.xml" type="application/atom+xml" title="Codemadness Atom Feed" />
           15         <link rel="alternate" href="atom_content.xml" type="application/atom+xml" title="Codemadness Atom Feed with content" />
           16         <link rel="icon" href="/favicon.png" type="image/png" />
           17 </head>
           18 <body>
           19         <nav id="menuwrap">
           20                 <table id="menu" width="100%" border="0">
           21                 <tr>
           22                         <td id="links" align="left">
           23                                 <a href="index.html">Blog</a> |
           24                                 <a href="/git/" title="Git repository with some of my projects">Git</a> |
           25                                 <a href="/releases/">Releases</a> |
           26                                 <a href="gopher://codemadness.org">Gopherhole</a>
           27                         </td>
           28                         <td id="links-contact" align="right">
           29                                 <span class="hidden"> | </span>
           30                                 <a href="feeds.html">Feeds</a> |
           31                                 <a href="pgp.asc">PGP</a> |
           32                                 <a href="mailto:hiltjo@AT@codemadness.DOT.org">Mail</a>
           33                         </td>
           34                 </tr>
           35                 </table>
           36         </nav>
           37         <hr class="hidden" />
           38         <main id="mainwrap">
           39                 <div id="main">
           40                         <article>
           41 <header>
           42         <h1>xargs: an example for parallel batch jobs</h1>
           43         <p>
           44         <strong>Last modification on </strong> <time>2023-12-17</time>
           45         </p>
           46 </header>
           47 
           48 <p>This describes a simple shellscript programming pattern to process a list of
           49 jobs in parallel. This script example is contained in one file.</p>
           50 <h1>Simple but less optimal example</h1>
           51 <pre><code>#!/bin/sh
           52 maxjobs=4
           53 
           54 # fake program for example purposes.
           55 someprogram() {
           56         echo "Yep yep, I'm totally a real program!"
           57         sleep "$1"
           58 }
           59 
           60 # run(arg1, arg2)
           61 run() {
           62         echo "[$1] $2 started" &gt;&amp;2
           63         someprogram "$1" &gt;/dev/null
           64         status="$?"
           65         echo "[$1] $2 done" &gt;&amp;2
           66         return "$status"
           67 }
           68 
           69 # process the jobs.
           70 j=1
           71 for f in 1 2 3 4 5 6 7 8 9 10; do
           72         run "$f" "something" &amp;
           73 
           74         jm=$((j % maxjobs)) # shell arithmetic: modulo
           75         test "$jm" = "0" &amp;&amp; wait
           76         j=$((j+1))
           77 done
           78 wait
           79 </code></pre>
           80 <h1>Why is this less optimal</h1>
           81 <p>This is less optimal because it waits until all jobs in the same batch are finished
           82 (each batch contain $maxjobs items).</p>
           83 <p>For example with 2 items per batch and 4 total jobs it could be:</p>
           84 <ul>
           85 <li>Job 1 is started.</li>
           86 <li>Job 2 is started.</li>
           87 <li>Job 2 is done.</li>
           88 <li>Job 1 is done.</li>
           89 <li>Wait: wait on process status of all background processes.</li>
           90 <li>Job 3 in new batch is started.</li>
           91 </ul>
           92 <p>This could be optimized to:</p>
           93 <ul>
           94 <li>Job 1 is started.</li>
           95 <li>Job 2 is started.</li>
           96 <li>Job 2 is done.</li>
           97 <li>Job 3 in new batch is started (immediately).</li>
           98 <li>Job 1 is done.</li>
           99 <li>...</li>
          100 </ul>
          101 <p>It also does not handle signals such as SIGINT (^C). However the xargs example
          102 below does:</p>
          103 <h1>Example</h1>
          104 <pre><code>#!/bin/sh
          105 maxjobs=4
          106 
          107 # fake program for example purposes.
          108 someprogram() {
          109         echo "Yep yep, I'm totally a real program!"
          110         sleep "$1"
          111 }
          112 
          113 # run(arg1, arg2)
          114 run() {
          115         echo "[$1] $2 started" &gt;&amp;2
          116         someprogram "$1" &gt;/dev/null
          117         status="$?"
          118         echo "[$1] $2 done" &gt;&amp;2
          119         return "$status"
          120 }
          121 
          122 # child process job.
          123 if test "$CHILD_MODE" = "1"; then
          124         run "$1" "$2"
          125         exit "$?"
          126 fi
          127 
          128 # generate a list of jobs for processing.
          129 list() {
          130         for f in 1 2 3 4 5 6 7 8 9 10; do
          131                 printf '%s\0%s\0' "$f" "something"
          132         done
          133 }
          134 
          135 # process jobs in parallel.
          136 list | CHILD_MODE="1" xargs -r -0 -P "${maxjobs}" -L 2 "$(readlink -f "$0")"
          137 </code></pre>
          138 <h1>Run and timings</h1>
          139 <p>Although the above example is kindof stupid, it already shows the queueing of
          140 jobs is more efficient.</p>
          141 <p>Script 1:</p>
          142 <pre><code>time ./script1.sh
          143 [...snip snip...]
          144 real    0m22.095s
          145 </code></pre>
          146 <p>Script 2:</p>
          147 <pre><code>time ./script2.sh
          148 [...snip snip...]
          149 real    0m18.120s
          150 </code></pre>
          151 <h1>How it works</h1>
          152 <p>The parent process:</p>
          153 <ul>
          154 <li>The parent, using xargs, handles the queue of jobs and schedules the jobs to
          155 execute as a child process.</li>
          156 <li>The list function writes the parameters to stdout. These parameters are
          157 separated by the NUL byte separator. The NUL byte separator is used because
          158 this character cannot be used in filenames (which can contain spaces or even
          159 newlines) and cannot be used in text (the NUL byte terminates the buffer for
          160 a string).</li>
          161 <li>The -L option must match the amount of arguments that are specified for the
          162 job. It will split the specified parameters per job.</li>
          163 <li>The expression "$(readlink -f "$0")" gets the absolute path to the
          164 shellscript itself. This is passed as the executable to run for xargs.</li>
          165 <li>xargs calls the script itself with the specified parameters it is being fed.
          166 The environment variable $CHILD_MODE is set to indicate to the script itself
          167 it is run as a child process of the script.</li>
          168 </ul>
          169 <p>The child process:</p>
          170 <ul>
          171 <li><p>The command-line arguments are passed by the parent using xargs.</p>
          172 </li>
          173 <li><p>The environment variable $CHILD_MODE is set to indicate to the script itself
          174 it is run as a child process of the script.</p>
          175 </li>
          176 <li><p>The script itself (ran in child-mode process) only executes the task and
          177 signals its status back to xargs and the parent.</p>
          178 </li>
          179 <li><p>The exit status of the child program is signaled to xargs. This could be
          180 handled, for example to stop on the first failure (in this example it is not).
          181 For example if the program is killed, stopped or the exit status is 255 then
          182 xargs stops running also.</p>
          183 </li>
          184 </ul>
          185 <h1>Description of used xargs options</h1>
          186 <p>From the OpenBSD man page: <a href="https://man.openbsd.org/xargs">https://man.openbsd.org/xargs</a></p>
          187 <pre><code>xargs - construct argument list(s) and execute utility
          188 </code></pre>
          189 <p>Options explained:</p>
          190 <ul>
          191 <li>-r: Do not run the command if there are no arguments. Normally the command
          192 is executed at least once even if there are no arguments.</li>
          193 <li>-0: Change xargs to expect NUL ('\0') characters as separators, instead of
          194 spaces and newlines.</li>
          195 <li>-P maxprocs: Parallel mode: run at most maxprocs invocations of utility
          196 at once.</li>
          197 <li>-L number: Call utility for every number of non-empty lines read. A line
          198 ending in unescaped white space and the next non-empty line are considered
          199 to form one single line. If EOF is reached and fewer than number lines have
          200 been read then utility will be called with the available lines.</li>
          201 </ul>
          202 <h1>xargs options -0 and -P, portability and historic context</h1>
          203 <p>Some of the options, like -P are as of writing (2023) non-POSIX:
          204 <a href="https://pubs.opengroup.org/onlinepubs/9699919799/utilities/xargs.html">https://pubs.opengroup.org/onlinepubs/9699919799/utilities/xargs.html</a>.
          205 However many systems support this useful extension for many years now.</p>
          206 <p>The specification even mentions implementations which support parallel
          207 operations:</p>
          208 <p>"The version of xargs required by this volume of POSIX.1-2017 is required to
          209 wait for the completion of the invoked command before invoking another command.
          210 This was done because historical scripts using xargs assumed sequential
          211 execution. Implementations wanting to provide parallel operation of the invoked
          212 utilities are encouraged to add an option enabling parallel invocation, but
          213 should still wait for termination of all of the children before xargs
          214 terminates normally."</p>
          215 <p>Some historic context:</p>
          216 <p>The xargs -0 option was added on 1996-06-11 by Theo de Raadt, about a year
          217 after the NetBSD import (over 27 years ago at the time of writing):</p>
          218 <p><a href="http://cvsweb.openbsd.org/cgi-bin/cvsweb/src/usr.bin/xargs/xargs.c?rev=1.2&amp;content-type=text/x-cvsweb-markup">CVS log</a></p>
          219 <p>On OpenBSD the xargs -P option was added on 2003-12-06 by syncing the FreeBSD
          220 code:</p>
          221 <p><a href="http://cvsweb.openbsd.org/cgi-bin/cvsweb/src/usr.bin/xargs/xargs.c?rev=1.14&amp;content-type=text/x-cvsweb-markup">CVS log</a></p>
          222 <p>Looking at the imported git history log of GNU findutils (which has xargs), the
          223 very first commit already had the -0 and -P option:</p>
          224 <p><a href="https://savannah.gnu.org/git/?group=findutils">git log</a></p>
          225 <pre><code>commit c030b5ee33bbec3c93cddc3ca9ebec14c24dbe07
          226 Author: Kevin Dalley &lt;kevin@seti.org&gt;
          227 Date:   Sun Feb 4 20:35:16 1996 +0000
          228 
          229     Initial revision
          230 </code></pre>
          231 <h1>xargs: some incompatibilities found</h1>
          232 <ul>
          233 <li>Using the -0 option empty fields are handled differently in different
          234 implementations.</li>
          235 <li>The -n and -L option doesn't work correctly in many of the BSD implementations.
          236 Some count empty fields, some don't.  In early implementations in FreeBSD and
          237 OpenBSD it only processed the first line.  In OpenBSD it has been improved
          238 around 2017.</li>
          239 </ul>
          240 <p>Depending on what you want to do a workaround could be to use the -0 option
          241 with a single field and use the -n flag.  Then in each child program invocation
          242 split the field by a separator.</p>
          243 <h1>References</h1>
          244 <ul>
          245 <li>xargs: <a href="https://man.openbsd.org/xargs">https://man.openbsd.org/xargs</a></li>
          246 <li>printf: <a href="https://man.openbsd.org/printf">https://man.openbsd.org/printf</a></li>
          247 <li>ksh, wait: <a href="https://man.openbsd.org/ksh#wait">https://man.openbsd.org/ksh#wait</a></li>
          248 <li>wait(2): <a href="https://man.openbsd.org/wait">https://man.openbsd.org/wait</a></li>
          249 </ul>
          250 
          251                         </article>
          252                 </div>
          253         </main>
          254 </body>
          255 </html>