
                         Howto/FAQ project vserver

1: contributions

  1.1: kernel

   A patch to have ext3 in the 2.4.13-ctx-3 kernel

   Contributed by Guillaume Dallaire (info@guillaum.org). The patch is
   available at [1]here.

   Are there any patches against the -ac kernel versions (Alan Cox dev
   tree)

   Some patches are maintained by Paul Kreiner [2]deacon@thedeacon.org.
   You can find them at [3]http://thedeacon.org/patches

   Is it working on kernel 2.2

   A patch is available at
   [4]http://vserver.digitalangel.com.au/patch-2.2.20ctx-8 .

   You will find some notes about the patch [5]here

2: File-systems

  2.1: Access

  2.1.1: Sharing

   Is it possible to share one area of a file system between vservers

   Vservers are running in chroot environment. As such, they can only see
   what is under their / directory. So it does not sounds possible to
   share one area or one file-system between several virtual servers.

   There is an option. Kernel 2.4 allows one volume to be mounted several
   time with different mount point. Say you have a volume /dev/hda3 and
   you would like to share it between vserver v1 and v2. You can do the
   following

        mkdir /vservers/v1/data
        mkdir /vservers/v2/data
        mount /dev/hda3 /vservers/v1/data
        mount /dev/hda3 /vservers/v2/data

   You can fill the /etc/fstab file so that /dev/hda3 is mounted at boot
   time.

   This is not completely flexible since you can only share a full
   partition. If you want to share a smaller area (and potentially
   several of those small area), you can loopback mount a file and share
   it. For example:

        dd if=/dev/zero bs=1024k count=10 of=/var/data
        /sbin/losetup /dev/loop0 /var/data
        /sbin/mke2fs /dev/loop0
        mount /dev/loop0 /vservers/v1/data
        mount /dev/loop0 /vservers/v2/data

   Kernel 2.4 also support the mount --bind option. This allows one to
   connect a directory in multiple places in the file system, even if
   this directory is not a mount point. For example, you may want to
   create several vservers to tests various distributions, yet you want
   to share the /home directory between each. The following command will
   do the job. This is probably the easiest way to share data between
   vservers.

        mount --bind /home /vservers/name/home

3: general

  3.1: administration

   Changing the host-name or the IP of a vserver

   For a vserver named xxx, do:

     * Stop the vserver
       vserver xxx stop
     * Edit the file /etc/vservers/xxx.conf and change the entries IPROOT
       and S_HOSTNAME.
     * Edit /vservers/xxx/etc/hosts and fix the IP and hostname as well.
     * Start the vserver:
       vserver xxx start

   How do I know in which security context process N is running

   You can tell this from the /proc/N/status file. Do this

        /usr/sbin/chcontext --ctx 1 cat /proc/N/status

   Check the s_context entry. You will see the context number (say X).
   Then do

        /usr/sbin/chcontext --ctx X /bin/sh

   You are now in the same security context as process N. You can kill
   it, trace it, whatever.

   Is it possible to move a vserver from one physical server to another

   Yes. In fact, a vserver is fairly hardware independent. You can move
   it from an IDE + uniprocessor server to SCSI + multiprocessor server
   without any reconfiguration. Just copy /etc/vservers/XX.conf and
   /vservers/XX to the new physical server and start it there. To move
   /vservers/XX, you may want to use rsync. For example

rsync -e ssh -avHl /vservers/XX new-server:/vserver/XX

   will do. You can use this to have fail-over. If a vserver is kept
   updated on a regular basis, using either rsync, or even network raid,
   a vserver may be started on a new machine without any fixes.

   Is it possible to run vservers based on different distro on ?

   Yes, no problem. For now the vserver project is a little
   redhat/mandrake aware, but some are using it with other distro. Once a
   service in running in a vserver, it is talking directly to the kernel.
   So a debian vserver could be running on a redhat server or the
   reverse. The only issues are

     * A vserver is normally created from the root server. If you intend
       to run a different distribution inside a vserver, you will have to
       copy it from another server, or find a way to install the package
       somehow.
     * The /usr/sbin/vserver is redhat-ish and assumes the services are
       configured in runlevel 3. We intend to solve this by having each
       vservers runs it owns /sbin/init. As such each vserver will have
       its own way of enabling the services. From /usr/sbin/vserver, the
       only thing to do would be to start/stop/signal /sbin/init in the
       vserver.

   Is it possible to see processes in other vservers

   A vserver can only see its own processes + init (to make pstree cute).

   The root server can only see its own process as well, to make the root
   server less scary to use. "killall httpd" in the root server will kill
   httpd in the root server only.

   The security context number 1 is reserved. This context can see all
   processes. The vserver package provides 3 little wrapper to help
   manage all the processes:

     * vtop
     * vpstree
     * vps

   Those wrappers are simply doing

        # The vpstree wrapper
        /usr/sbin/chcontext --ctx 1 pstree $*

   Only root in the root server is allowed to "jump" into a specific
   context.

   May I rename a vserver ?

   To change the name of a vserver, from oldname to newname. do the
   following:

        mv /vservers/oldname /vservers/newname
        mv /etc/vservers/oldname.conf /etc/vservers/newname.conf
        mv /etc/vservers/oldname.sh /etc/vservers/newname.sh

   To avoid problems, you must stop the vserver before doing so. If you
   want to rename the vserver while it is running, you will have to do
   the following:

     * Rename /var/run/vservers/oldname.ctx to newname.ctx. Unless you do
       so, you won't be able to control the vserver (stop,restart enter
       it, ...).
     * Update /etc/mtab because /proc and /dev/pts are mounted in
       /vservers/oldname, not newname.

   Stopping the vserver first is a better idea :-)

  3.1.1: unification

   A unified vserver seems as big as the reference, how come ?

   You have created a new vserver using the /usr/sbin/newvserver command.
   You have selected the "unified mode" check-box. Once created, just to
   make sure, you run the du command on both the reference vserver and
   the new one

        du /vservers/ref
        du /vservers/new

   The du command produced the same result. Not saving much disk space ?

   The du" command is not the right tool to test this. One easy way to
   test is to run df before and after the new vserver creation. This will
   show the exact amount of disk space allocated to the new vserver. Here
   is the explanation:

   Unification is made using hard links. A hard link is another name
   pointing to the same data. The entity controlling the mapping of a
   file on a Unix/Linux file system is called an inode. It contains
   information about the file location (the blocks making the file), the
   access right and ownership, and a few other flags. The name is not
   stored in the inode itself. A directory contains a list of names. Each
   name points to an inode. The inode also hold a reference counter so it
   knows how many directory entries point to itself. From this
   explanation, you see that a file name points to inode, and does not
   relate to any other name pointing to the same inode. All we can tell
   is how many names are pointing to the same inode. Finding which names
   point to a single inode involves a complete file system traversal,
   opening every directory to find a name pointing to the given inode.

   Here is a little demonstration:

        cd /tmp
        # We create a dummy file and see which inode number it has
        touch dummy
        ls -i dummy
        # The number printed is the inode number
        # Now we create a link to this file
        ln dummy dummy2
        # Now we check that the reference count is 2 since
        # both dummy and dommy2 points to the same inode
        ls -l dummy2
        # What is the inode of dummy2
        ls -i dummy2
        # The same as dummy.

   What is the point ? We have two files, dummy and dummy2, each pointing
   to the same inode. Which one is the real file ? Anyone is the real. I
   can delete dummy and dummy2 will continue to exist unchanged. I can
   delete dummy2 and dummy will continue to exist. If I delete both, then
   the space allocated will be freed.

   Back to our du utility

        du dummy
        du dummy2

   We are getting the same result. The command do not care about the
   reference count. dummy2 is as real as dummy. Applied to a unified
   vserver, we get the same result on the original and new. For the du
   command, neither is more the owner of the files. This also shows how
   independent are two unified vservers. They are sharing the same data
   space, yet they are truly independent. Package may be updated in one
   and it won't affect the other. The vserver ref may be delete and this
   won't affect the new vserver.

   Sometime, it is useful to find how much disk space is used by one
   vserver alone. The /usr/lib/vserver/vdu utility was written for this
   purpose. It works like du (a minimal one) except it ignores files with
   more than one link. Hard links are seldom used in a vserver, so the is
   rather precise. vdu will indeed show that your new vserver is not so
   big after all. But if you apply it to the ref vserver, you will get
   the same (small) result.

   How to update 10 unified vservers and keep them unified ?

   You have 10 vservers and they are unified. So you are saving a good
   amount of disk space. Although, unified vservers are sharing common
   file through hard link and special immutability flags, they can be
   updated independently. Well this is in fact the only way. There is no
   magic way to update one package on one vserver (the reference one or
   not) and have the change inherited magically by the other vserver. The
   update operation has to be done 10 times.

   The /usr/sbin/vrpm utility has been created to ease those updates. For
   example, say you have 4 vservers v1 v2 v3 and v4 and 3 packages a.rpm,
   b.rpm and c.rpm to update. You do:

vrpm v1 v2 v3 v4 -- a.rpm b.rpm c.rpm
or
vrpm ALL -- a.rpm b.rpm c.rpm

   The last command will apply the updates to all your vservers, one
   after the other.

   Now, after performing this steps, you end up with 4 vservers updated
   independently. The disk space is not unified any more, for those 3
   packages. To regain unification, you do:

/usr/lib/vserver/vunify v1 v2 v3 v4 -- a b c

   vunify may be use any time.to re-unify vservers. You may want to run
   it after you have performed major RPM updates.

   Is it possible to move a unified vserver without the reference vserver
   ?

   yes, the unification (hard linking common file) does not establish a
   parent-hood relation with the reference server. they just end up
   sharing common area on the disk drive (the hard linked file). A
   reference vserver may be updated without affected vservers created
   from it. once a vserver is created, unified or not, it is fully
   independent.

   Is it possible to use hard links between vservers or the root server

   Yes, hard link are low level and work across chroot(). This is exactly
   what the vunify command is doing to save disk space. Using the
   immutable ext2 file attribute, you can share files between virtual
   server and be sure none can change them.

   In fact, newvserver default to create unified vservers (vservers
   sharing common files using hard links). Using the new
   immutable-linkage-invert, vserver are sharing common file, using much
   much less disk space (a common vserver is between 20-40 megs) yet they
   can be updated independently without side effects.

  3.2: misc

   How does this differs from the BSD jail system call

   It differs a little. It is somewhat more flexible because it uses 3
   system calls (chroot, set_ipv4root, new_s_context) to achieve the job.
   So each system call may be used independently.

   For example, if you want to limit xinetd service in the root server to
   a single IP, you can do

        /usr/sbin/chbind --ip eth0 /etc/rc.d/init.d/xinetd restart

   The package provides the v_xinetd for this purpose. So to get this
   going, you need very little reconfiguration. No fiddling in
   configuration files and so on.

   I am unsure about the jail system call and the new_s_context() I have
   implemented though. The later is used to isolate the process in a
   private world where it can't see and interact with other processes in
   the box, except itself. The new_s_context is not privileged, so a
   normal user can use this to, for example, setup a personal security
   box before executing a not-so-trusted game.

   Also the new_s_context() syscall allow root user in the root server to
   "enter" a running vserver, unlike the jail syscall (which can't add
   new processes to a running "jail"). On this side, the implementation
   is also more flexible. This is very useful, because it allows the root
   server to monitor the vservers and to start and stop them very easily,
   in a clean way.

   What about performance

   You can expect the exact same performance in a vserver as compared to
   the root server. There is no overhead. Processes running in the
   vserver are talking directly to the kernel. Only few system calls
   (kill for one) have special checks to insure processes isolation.

  3.3: starting

   Is it possible to execute some tasks when a vserver is started

   The vserver utility checks if there is a file /etc/vservers/name.sh
   when it is operating a vserver called "name". This file is a script
   and is called in four case: Before starting a vserver, after, before
   stopping it and after. The first argument one of pre-start,
   post-start, pre-stop and post-stop. The second argument is the name of
   the vserver. A typical script looks like:

        #!/bin/sh
        case $1 in
        pre-start)
                mount --bind /home /vservers/$2/home
                ;;
        post-start)
                ;;
        pre-stop)
                ;;
        post-stop)
                umount /vservers/$2/home
                ;;
        esac

4: issues

  4.1: applications

   bind does not work in a vserver (capset failed)

   The bind package expect to have the capability CAP_SYS_RESOURCE. It
   expects this because it may need to increase its ulimit. By default, a
   vserver does not have this capability. A vserver starts with some
   ulimit values and can only reduce them, not enlarge them. The idea is
   to control what a vserver can use.

   To fix that, one can give the capability to the vserver running bind.
   Edit the vserver configuration file (/etc/vservers/*.conf) and modify
   the S_CAPS line like this

        S_CAPS="CAP_SYS_RESOURCE"

   Using DHCP server in a vserver

   Since 2.4.18ctx-9, this is possible, but there is a catch. The
   set_ipv4root assign one IP and one broadcast address to a vserver. UDP
   service listening to 0.0.0.0 (bind any) in a vserver are indeed
   listening to the vserver IP and vserver broadcast. This is all they
   will get. This is fine for most service.

   Unfortunately, dhcpd is receiving special broadcasts. Its clients are
   unaware of their IP number, so they are using special broadcast
   255.255.255.255 address.

   A vserver generally runs with the broadcast address of the network
   device (the one used to setup the IP alias). This network device has a
   broadcast address which is never 255.255.255.255. Those special
   broadcast are not sent to the vserver. The solution is to set the
   IPROOTBCAST entry in the vserver configuration file like this

IPROOTBCAST=255.255.255.255

   Restart your vserver and dhcpd will work. There is a catch (at least
   with 2.4.18ctx-9). If you are using other services in the save
   vserver, also relying on broadcast for proper operation (samba for
   one), they won't operate properly.

   One solution would be to enhance the semantic a little: A vserver
   would listen for its IP address, its broadcast address and also for
   255.255.255.255. The dhcpd case is probably very specific though.

   Btw, we are running dhcpd in a vserver because we are using heartbeat
   to provide failover for this service as well.

5: security

  5.1: misc

   Vservers can write to /dev/random, is this a problem ?

   I found the following post on linux-kernel

   [6]
   
   which states:
   
          No, writing to /dev/random does not feed update entropy
          estimate. It does mix data into the pool, but the mixing
          algorithm is designed so that you can do no harm by mixing any
          data into the pool --- even nasty data chosen by an attacker.
          Hence, allowing someone to write into /dev/random is perfectly
          safe; it can cause no damage, and might improve things. That's
          why /dev/random should be world-writable. There is a separate
          ioctl which requires root privs to atomically mix data into the
          pool and update the entropy estimate. That's the interface
          which is supposed to be used by trusted daemons which pull data
          from various hardware devices, and feed them into /dev/random.

   So writing is safe. How about ioctls. Some may indeed influence the
   entropy pool. But they are already protected by the CAP_SYS_ADMIN
   capability, so even root in a virtual private server can't use them.

  5.2: principles

   Is a chroot() environment really unbreakable

   Since the kernel 2.4.17ctx-6, all issues with chroot are now plugged.
   root inside a vserver, even with the CAP_SYS_CHROOT capability can't
   escape out.

   Here are the usual tricks used to escape a chroot environment.

     * Open the hard drive device
       There are two ways one may escape from a chroot environment. One
       is to setup a block device special file, open it with some
       user-land file system browser. At this point, you are bypassing
       both the chroot restriction and all file system access control.
       This does not work on a vserver because the special device are not
       created in /dev and the CAP_SYS_MKNOD capability is disabled. Even
       root can't create special files in a vserver.
       So this hole is plugged.
     * Changing the root while keeping the current directory behind.
       This is a trick to exploit a flaw in the chroot() system call. The
       system call is changing the logical root directory for the calling
       process but does not change the current working directory of the
       process. So the current working directory is (generally) left
       behind the new root directory.
       The /usr/sbin/chroot command takes care of this flaw by changing
       the working directory, but the system call does not do that.
       So if you process has now a working directory behind (out of
       scope) the new root, it is allow to change it up to the real root
       by doing multiple chdir ("..") system call. The process kind of
       escape from the kernel radar.
       This is a common bug found on Unix chroot() and Linux had this
       flaw also. Kernel 2.4.13 has been tested and does not show this
       problem.

   So it seems chroot() is safe. Anyone has more information about this ?

References

   1. ftp://ftp.solucorp.qc.ca/pub/vserver/patches/patch-2.4.13ctx-3-ext3
   2. mailto:deacon@thedeacon.org
   3. http://thedeacon.org/patches
   4. http://vserver.digitalangel.com.au/patch-2.2.20ctx-8
   5. http://vserver.digitalangel.com.au/
   6. http://www.uwsg.iu.edu/hypermail/linux/kernel/0012.2/0502.html
