xargs.md - www.codemadness.org - www.codemadness.org saait content files
(HTM) git clone git://git.codemadness.org/www.codemadness.org
(DIR) Log
(DIR) Files
(DIR) Refs
(DIR) README
(DIR) LICENSE
---
xargs.md (6992B)
---
1 This describes a simple shellscript programming pattern to process a list of
2 jobs in parallel. This script example is contained in one file.
3
4
5 # Simple but less optimal example
6
7 #!/bin/sh
8 maxjobs=4
9
10 # fake program for example purposes.
11 someprogram() {
12 echo "Yep yep, I'm totally a real program!"
13 sleep "$1"
14 }
15
16 # run(arg1, arg2)
17 run() {
18 echo "[$1] $2 started" >&2
19 someprogram "$1" >/dev/null
20 status="$?"
21 echo "[$1] $2 done" >&2
22 return "$status"
23 }
24
25 # process the jobs.
26 j=1
27 for f in 1 2 3 4 5 6 7 8 9 10; do
28 run "$f" "something" &
29
30 jm=$((j % maxjobs)) # shell arithmetic: modulo
31 test "$jm" = "0" && wait
32 j=$((j+1))
33 done
34 wait
35
36
37 # Why is this less optimal
38
39 This is less optimal because it waits until all jobs in the same batch are finished
40 (each batch contain $maxjobs items).
41
42 For example with 2 items per batch and 4 total jobs it could be:
43
44 * Job 1 is started.
45 * Job 2 is started.
46 * Job 2 is done.
47 * Job 1 is done.
48 * Wait: wait on process status of all background processes.
49 * Job 3 in new batch is started.
50
51
52 This could be optimized to:
53
54 * Job 1 is started.
55 * Job 2 is started.
56 * Job 2 is done.
57 * Job 3 in new batch is started (immediately).
58 * Job 1 is done.
59 * ...
60
61
62 It also does not handle signals such as SIGINT (^C). However the xargs example
63 below does:
64
65
66 # Example
67
68 #!/bin/sh
69 maxjobs=4
70
71 # fake program for example purposes.
72 someprogram() {
73 echo "Yep yep, I'm totally a real program!"
74 sleep "$1"
75 }
76
77 # run(arg1, arg2)
78 run() {
79 echo "[$1] $2 started" >&2
80 someprogram "$1" >/dev/null
81 status="$?"
82 echo "[$1] $2 done" >&2
83 return "$status"
84 }
85
86 # child process job.
87 if test "$CHILD_MODE" = "1"; then
88 run "$1" "$2"
89 exit "$?"
90 fi
91
92 # generate a list of jobs for processing.
93 list() {
94 for f in 1 2 3 4 5 6 7 8 9 10; do
95 printf '%s\0%s\0' "$f" "something"
96 done
97 }
98
99 # process jobs in parallel.
100 list | CHILD_MODE="1" xargs -r -0 -P "${maxjobs}" -L 2 "$(readlink -f "$0")"
101
102
103 # Run and timings
104
105 Although the above example is kindof stupid, it already shows the queueing of
106 jobs is more efficient.
107
108 Script 1:
109
110 time ./script1.sh
111 [...snip snip...]
112 real 0m22.095s
113
114 Script 2:
115
116 time ./script2.sh
117 [...snip snip...]
118 real 0m18.120s
119
120
121 # How it works
122
123 The parent process:
124
125 * The parent, using xargs, handles the queue of jobs and schedules the jobs to
126 execute as a child process.
127 * The list function writes the parameters to stdout. These parameters are
128 separated by the NUL byte separator. The NUL byte separator is used because
129 this character cannot be used in filenames (which can contain spaces or even
130 newlines) and cannot be used in text (the NUL byte terminates the buffer for
131 a string).
132 * The -L option must match the amount of arguments that are specified for the
133 job. It will split the specified parameters per job.
134 * The expression "$(readlink -f "$0")" gets the absolute path to the
135 shellscript itself. This is passed as the executable to run for xargs.
136 * xargs calls the script itself with the specified parameters it is being fed.
137 The environment variable $CHILD_MODE is set to indicate to the script itself
138 it is run as a child process of the script.
139
140
141 The child process:
142
143 * The command-line arguments are passed by the parent using xargs.
144
145 * The environment variable $CHILD_MODE is set to indicate to the script itself
146 it is run as a child process of the script.
147
148 * The script itself (ran in child-mode process) only executes the task and
149 signals its status back to xargs and the parent.
150
151 * The exit status of the child program is signaled to xargs. This could be
152 handled, for example to stop on the first failure (in this example it is not).
153 For example if the program is killed, stopped or the exit status is 255 then
154 xargs stops running also.
155
156
157 # Description of used xargs options
158
159 From the OpenBSD man page: <https://man.openbsd.org/xargs>
160
161 xargs - construct argument list(s) and execute utility
162
163 Options explained:
164
165 * -r: Do not run the command if there are no arguments. Normally the command
166 is executed at least once even if there are no arguments.
167 * -0: Change xargs to expect NUL ('\0') characters as separators, instead of
168 spaces and newlines.
169 * -P maxprocs: Parallel mode: run at most maxprocs invocations of utility
170 at once.
171 * -L number: Call utility for every number of non-empty lines read. A line
172 ending in unescaped white space and the next non-empty line are considered
173 to form one single line. If EOF is reached and fewer than number lines have
174 been read then utility will be called with the available lines.
175
176
177 # xargs options -0 and -P, portability and historic context
178
179 Some of the options, like -P are as of writing (2023) non-POSIX:
180 <https://pubs.opengroup.org/onlinepubs/9699919799/utilities/xargs.html>.
181 However many systems support this useful extension for many years now.
182
183 The specification even mentions implementations which support parallel
184 operations:
185
186 "The version of xargs required by this volume of POSIX.1-2017 is required to
187 wait for the completion of the invoked command before invoking another command.
188 This was done because historical scripts using xargs assumed sequential
189 execution. Implementations wanting to provide parallel operation of the invoked
190 utilities are encouraged to add an option enabling parallel invocation, but
191 should still wait for termination of all of the children before xargs
192 terminates normally."
193
194
195 Some historic context:
196
197 The xargs -0 option was added on 1996-06-11 by Theo de Raadt, about a year
198 after the NetBSD import (over 27 years ago at the time of writing):
199
200 [CVS log](http://cvsweb.openbsd.org/cgi-bin/cvsweb/src/usr.bin/xargs/xargs.c?rev=1.2&content-type=text/x-cvsweb-markup)
201
202 On OpenBSD the xargs -P option was added on 2003-12-06 by syncing the FreeBSD
203 code:
204
205 [CVS log](http://cvsweb.openbsd.org/cgi-bin/cvsweb/src/usr.bin/xargs/xargs.c?rev=1.14&content-type=text/x-cvsweb-markup)
206
207
208 Looking at the imported git history log of GNU findutils (which has xargs), the
209 very first commit already had the -0 and -P option:
210
211 [git log](https://savannah.gnu.org/git/?group=findutils)
212
213 commit c030b5ee33bbec3c93cddc3ca9ebec14c24dbe07
214 Author: Kevin Dalley <kevin@seti.org>
215 Date: Sun Feb 4 20:35:16 1996 +0000
216
217 Initial revision
218
219
220 # xargs: some incompatibilities found
221
222 * Using the -0 option empty fields are handled differently in different
223 implementations.
224 * The -n and -L option doesn't work correctly in many of the BSD implementations.
225 Some count empty fields, some don't. In early implementations in FreeBSD and
226 OpenBSD it only processed the first line. In OpenBSD it has been improved
227 around 2017.
228
229 Depending on what you want to do a workaround could be to use the -0 option
230 with a single field and use the -n flag. Then in each child program invocation
231 split the field by a separator.
232
233
234 # References
235
236 * xargs: <https://man.openbsd.org/xargs>
237 * printf: <https://man.openbsd.org/printf>
238 * ksh, wait: <https://man.openbsd.org/ksh#wait>
239 * wait(2): <https://man.openbsd.org/wait>