Design for 'Execve' and 'Uselib'
Michael Elizabeth Chastain
<mec@duracef.shout.net>
Sat 18 Nov 1995

Copyright 1995 Michael Chastain
Licensed under the Gnu Public License, Version 2



Modelling these calls is an ongoing challenge.

'execve' and 'uselib' alter the address space of the target.  So I
cannot use fetch-smash-store; I have to use record-replay.  'uselib'
is much simpler than 'execve', so I will write only about 'execve'.

The arguments to these calls include a file name.  The tracer has to
record the contents of this file, and the replayer has to arrange for
the same file to be present.

The tracer and the target process do not have the same filesystem name
space: the target process can call 'chdir', 'chroot', and 'fchdir'
before calling 'execve' and 'uselib'.  The tracer must handle this,
in order to mark events unreplayable as needed.

The 'PrDir' class tracks directory changes in the target.  The tracer
records when the trace process changes its root directory or its current
directory.  If no changes have been made, the tracer can read a target
file name directly.  If changes have been made, the tracer attempts to
use '/proc/$pid-child/root' or '/proc/$pid-child/cwd' to get the child's
root or current directory.

Even without '/proc', the tracer can follow almost all target programs:

    Most targets never call 'chroot', so all absolute path names work.
    Most targets never call 'execve' (they call 'fork' and child execs).
    Most targets give absolute path names to 'uselib'.
    Most targets call 'uselib' only at beginning of execution.

The tracer fetches the file on the EvSco side, and only if the call
succeeded.  If the filename is strange, the call will fail in the target
and the tracer won't have to fetch it.  This avoids some problems with
calls such as 'execve .' or 'uselib /dev/audio'.

At replay time, the file must be written on the EvSci side.  The replayer
must look ahead in the event history to find the matching EvSco and pull
the dataset out of there.

When the replayer starts execution, it writes all the datasets from the
trace file into a stash directory, '/tmp/stash-$pid'.  It smashes the
dataset names by replacing '/' with '%' and appending a unique sequence
number to the name.  These files live for the life of the replayer.

The replayer also creates a stash directory for each target process.
These directories contain links into the data-stash directory.
The links have the dataset names smashed with '/' -> '%' replacement,
but without a sequence number.  These links are transient; they are
created and destroyed only for the duration of the target system call
which uses them.

The target process is cd'ed into its own stash directory, and it's not
leaving, because all of its 'chdir', 'chroot', and 'fchdir' calls are
getting annulled.

The '/' -> '%' transformation has several necessary properties.  First,
the smashed name has no '/' in it.  I need this as the name is often an
absolute path name, and I can't force absolute path names to have
arbitrary data in them.

Second, the smashed name is exactly the same length.  I need this
because Linux 'execve' writes its filename into the target stack, so a
different length would give a different-sized stack!

Third, the name may be aliased with an 'argv' or 'envp' string; aliasing
with argv[0] is common.  Again, the name has to remain the same length,
so that a same-length string gets written into the target process when
smashing its data.

(What happens if the name is aliased with the 'argv' or 'envp' pointer
block?  Not the data itself, but the pointer block?  Then smashing it
would ruin some pointers and the target is unreplayable.  This is a
'nasty name overlap'.  Fortunately, a program with an NNO is sick, and
reporting the NNO is a sufficient diagnostic).

After the replay process has come out of 'execve', it has whole new
areas and data.  The dataset has determined most of the new process
state, but not all of it.

The new process stack contains the name of the executing file, which is
the smashed name.  If 'argv[0]' or some other 'argv' or 'envp' string
was aliased to the name, it will have smashed data as well.  So the
tracer fetches the entire original stack as a segment, and the replayer
stores this stack, after checking that the size is the same on replay.

As of Linux version 1.3.42, 'execve' does -not- initialize the
registers, so they have contents left from before 'execve'.  Ordinarily
this is not a problem, as the replay process matches the target process,
including the register values on entrance to all system calls.  But
there is a special circumstance with the first 'execve' that starts
replaying.  To handle this, the tracer fetches all the registers after
'execve', and the replayer stores them.

Another problem is updating the map.  The target map depends upon the
contents of the dataset, and I have to parse this dataset.  Right now, I
have a parser for 'a.out' executables and shared libraries.  I need to
handle ELF format, which includes loading the ELF interpreter.  I need
to handle '#!' scripts.

It would be sooooo nice if the tracer could fetch the address map of
the child, and if the replayer could store the address map.  Then I
could dispense with directory following, dataset fetching, dataset
storing in the stash, name smashing, and diagnosing NNO's.  I would
simply fetch-smash-store as usual.
