
                         Howto/FAQ project vserver

1: contributions

  1.1: documentation

   Is there other documentation available ?

   There   is   the   excellent   FAQ   written   by   Paul   Sladen   at
   [1]http://www.paul.sladen.org/vserver/faq 

  1.2: kernel

   A patch to have ext3 in the 2.4.13-ctx-3 kernel

   Contributed  by  Guillaume  Dallaire (info@guillaum.org). The patch is
   available at [2]here.

   Are  there  any  patches against the -ac kernel versions (Alan Cox dev
   tree)

   Some  patches  are maintained by Paul Kreiner [3]deacon@thedeacon.org.
   You can find them at [4]http://thedeacon.org/patches

   Is it working on kernel 2.2

   A patch is available at
   [5]http://vserver.digitalangel.com.au/patch-2.2.20ctx-8 .

   You will find some notes about the patch [6]here

2: File-systems

  2.1: Access

  2.1.1: Sharing

   Is it possible to share one area of a file system between vservers

   Vservers are running in chroot environment. As such, they can only see
   what  is  under  their  / directory. So it does not sounds possible to
   share one area or one file-system between several virtual servers.

   There is an option. Kernel 2.4 allows one volume to be mounted several
   time  with  different mount point. Say you have a volume /dev/hda3 and
   you  would  like to share it between vserver v1 and v2. You can do the
   following

        mkdir /vservers/v1/data
        mkdir /vservers/v2/data
        mount /dev/hda3 /vservers/v1/data
        mount /dev/hda3 /vservers/v2/data

   You  can fill the /etc/fstab file so that /dev/hda3 is mounted at boot
   time.

   This  is  not  completely  flexible  since  you  can only share a full
   partition.  If  you  want  to  share  a  smaller area (and potentially
   several  of those small area), you can loopback mount a file and share
   it. For example:

        dd if=/dev/zero bs=1024k count=10 of=/var/data
        /sbin/losetup /dev/loop0 /var/data
        /sbin/mke2fs /dev/loop0
        mount /dev/loop0 /vservers/v1/data
        mount /dev/loop0 /vservers/v2/data

   Kernel  2.4  also  support the mount --bind option. This allows one to
   connect  a  directory  in  multiple places in the file system, even if
   this  directory  is  not  a  mount point. For example, you may want to
   create  several  vservers to tests various distributions, yet you want
   to  share the /home directory between each. The following command will
   do  the  job.  This  is probably the easiest way to share data between
   vservers.

        mount --bind /home /vservers/name/home

3: general

  3.1: administration

   Changing the host-name or the IP of a vserver

   For a vserver named xxx, do:

     * Stop the vserver
       vserver xxx stop
     * Edit the file /etc/vservers/xxx.conf and change the entries IPROOT
       and S_HOSTNAME.
     * Edit /vservers/xxx/etc/hosts and fix the IP and hostname as well.
     * Start the vserver:
       vserver xxx start

   How do I know in which security context process N is running

   You can tell this from the /proc/N/status file. Do this

        /usr/sbin/chcontext --ctx 1 cat /proc/N/status

   Check  the  s_context  entry. You will see the context number (say X).
   Then do

        /usr/sbin/chcontext --ctx X /bin/sh

   You  are  now  in the same security context as process N. You can kill
   it, trace it, whatever.

   Is it possible to have a different time zone for one vserver

   Yes.  The  timezone  is  really cosmetic. The timezone is handled by a
   file  called  /etc/localtime.  Whenever a program tries to present the
   system time, it request the current time from the kernel and receive a
   GMT time. It then uses /etc/localtime to find out how to translate the
   time  to  local time. Each vserver may have a different one. Just copy
   the proper file from /usr/share/zoneinfo over /etc/localtime.

   A  vserver  is not allowed to change the system time unless it has the
   CAP_SYS_ADMIN  capability.  But  it  is not needed to have a different
   timezone.

   Is it possible to move a vserver from one physical server to another

   Yes.  In  fact, a vserver is fairly hardware independent. You can move
   it  from  an IDE + uniprocessor server to SCSI + multiprocessor server
   without  any  reconfiguration.  Just  copy  /etc/vservers/XX.conf  and
   /vservers/XX  to  the  new physical server and start it there. To move
   /vservers/XX, you may want to use rsync. For example

rsync -e ssh -avHl /vservers/XX new-server:/vserver/XX

   will  do.  You  can  use  this to have fail-over. If a vserver is kept
   updated  on a regular basis, using either rsync, or even network raid,
   a vserver may be started on a new machine without any fixes.

   Is it possible to run vservers based on different distro on ?

   Yes,   no   problem.   For   now  the  vserver  project  is  a  little
   redhat/mandrake aware, but some are using it with other distro. Once a
   service in running in a vserver, it is talking directly to the kernel.
   So  a  debian  vserver  could  be  running  on  a redhat server or the
   reverse. The only issues are

     * A  vserver is normally created from the root server. If you intend
       to run a different distribution inside a vserver, you will have to
       copy  it from another server, or find a way to install the package
       somehow.
     * The  /usr/sbin/vserver  is redhat-ish and assumes the services are
       configured  in  runlevel 3. We intend to solve this by having each
       vservers  runs  it owns /sbin/init. As such each vserver will have
       its  own way of enabling the services. From /usr/sbin/vserver, the
       only  thing  to do would be to start/stop/signal /sbin/init in the
       vserver.

   Is it possible to see processes in other vservers

   A vserver can only see its own processes + init (to make pstree cute).

   The root server can only see its own process as well, to make the root
   server less scary to use. "killall httpd" in the root server will kill
   httpd in the root server only.

   The  security  context  number 1 is reserved. This context can see all
   processes.  The  vserver  package  provides  3  little wrapper to help
   manage all the processes:

     * vtop
     * vpstree
     * vps

   Those wrappers are simply doing

        # The vpstree wrapper
        /usr/sbin/chcontext --ctx 1 pstree $*

   Only  root  in  the  root  server is allowed to "jump" into a specific
   context.

   May I rename a vserver ?

   To  change  the  name  of  a  vserver, from oldname to newname. do the
   following:

        mv /vservers/oldname /vservers/newname
        mv /etc/vservers/oldname.conf /etc/vservers/newname.conf
        mv /etc/vservers/oldname.sh /etc/vservers/newname.sh

   To  avoid  problems, you must stop the vserver before doing so. If you
   want  to  rename  the vserver while it is running, you will have to do
   the following:

     * Rename /var/run/vservers/oldname.ctx to newname.ctx. Unless you do
       so,  you  won't be able to control the vserver (stop,restart enter
       it, ...).
     * Update  /etc/mtab  because  /proc  and  /dev/pts  are  mounted  in
       /vservers/oldname, not newname.

   Stopping the vserver first is a better idea :-)

  3.1.1: unification

   A unified vserver seems as big as the reference, how come ?

   You have created a new vserver using the /usr/sbin/newvserver command.
   You  have selected the "unified mode" check-box. Once created, just to
   make  sure,  you  run the du command on both the reference vserver and
   the new one

        du /vservers/ref
        du /vservers/new

   The du command produced the same result. Not saving much disk space ?

   The  du"  command  is not the right tool to test this. One easy way to
   test is to run df before and after the new vserver creation. This will
   show the exact amount of disk space allocated to the new vserver. Here
   is the explanation:

   Unification  is  made  using  hard  links. A hard link is another name
   pointing  to  the  same  data. The entity controlling the mapping of a
   file  on  a  Unix/Linux  file  system  is called an inode. It contains
   information  about the file location (the blocks making the file), the
   access  right  and  ownership,  and a few other flags. The name is not
   stored in the inode itself. A directory contains a list of names. Each
   name points to an inode. The inode also hold a reference counter so it
   knows   how   many  directory  entries  point  to  itself.  From  this
   explanation,  you  see  that a file name points to inode, and does not
   relate  to  any other name pointing to the same inode. All we can tell
   is  how many names are pointing to the same inode. Finding which names
   point  to  a  single  inode involves a complete file system traversal,
   opening every directory to find a name pointing to the given inode.

   Here is a little demonstration:

        cd /tmp
        # We create a dummy file and see which inode number it has
        touch dummy
        ls -i dummy
        # The number printed is the inode number
        # Now we create a link to this file
        ln dummy dummy2
        # Now we check that the reference count is 2 since
        # both dummy and dommy2 points to the same inode
        ls -l dummy2
        # What is the inode of dummy2
        ls -i dummy2
        # The same as dummy.

   What is the point ? We have two files, dummy and dummy2, each pointing
   to  the same inode. Which one is the real file ? Anyone is the real. I
   can  delete  dummy  and dummy2 will continue to exist unchanged. I can
   delete dummy2 and dummy will continue to exist. If I delete both, then
   the space allocated will be freed.

   Back to our du utility

        du dummy
        du dummy2

   We  are  getting  the  same  result. The command do not care about the
   reference  count.  dummy2  is  as  real as dummy. Applied to a unified
   vserver,  we  get  the same result on the original and new. For the du
   command,  neither  is more the owner of the files. This also shows how
   independent  are  two unified vservers. They are sharing the same data
   space,  yet  they are truly independent. Package may be updated in one
   and  it won't affect the other. The vserver ref may be delete and this
   won't affect the new vserver.

   Sometime,  it  is  useful  to  find how much disk space is used by one
   vserver  alone.  The /usr/lib/vserver/vdu utility was written for this
   purpose. It works like du (a minimal one) except it ignores files with
   more than one link. Hard links are seldom used in a vserver, so the is
   rather  precise.  vdu will indeed show that your new vserver is not so
   big  after  all.  But if you apply it to the ref vserver, you will get
   the same (small) result.

   How to update 10 unified vservers and keep them unified ?

   You  have  10  vservers and they are unified. So you are saving a good
   amount  of  disk  space. Although, unified vservers are sharing common
   file  through  hard  link  and special immutability flags, they can be
   updated  independently. Well this is in fact the only way. There is no
   magic  way  to update one package on one vserver (the reference one or
   not) and have the change inherited magically by the other vserver. The
   update operation has to be done 10 times.

   The /usr/sbin/vrpm utility has been created to ease those updates. For
   example, say you have 4 vservers v1 v2 v3 and v4 and 3 packages a.rpm,
   b.rpm and c.rpm to update. You do:

vrpm v1 v2 v3 v4 -- a.rpm b.rpm c.rpm
or
vrpm ALL -- a.rpm b.rpm c.rpm

   The  last  command  will  apply  the updates to all your vservers, one
   after the other.

   Now,  after  performing this steps, you end up with 4 vservers updated
   independently.  The  disk  space  is not unified any more, for those 3
   packages. To regain unification, you do:

/usr/lib/vserver/vunify v1 v2 v3 v4 -- a b c

   vunify  may  be use any time.to re-unify vservers. You may want to run
   it after you have performed major RPM updates.

   Is it possible to move a unified vserver without the reference vserver
   ?

   yes,  the  unification (hard linking common file) does not establish a
   parent-hood  relation  with  the  reference  server.  they just end up
   sharing  common  area  on  the  disk  drive  (the hard linked file). A
   reference  vserver  may  be  updated without affected vservers created
   from  it.  once  a  vserver  is  created,  unified or not, it is fully
   independent.

   Is it possible to use hard links between vservers or the root server

   Yes, hard link are low level and work across chroot(). This is exactly
   what  the  vunify  command  is  doing  to  save  disk space. Using the
   immutable  ext2  file  attribute,  you can share files between virtual
   server and be sure none can change them.

   In  fact,  newvserver  default  to  create  unified vservers (vservers
   sharing    common   files   using   hard   links).   Using   the   new
   immutable-linkage-invert,  vserver are sharing common file, using much
   much less disk space (a common vserver is between 20-40 megs) yet they
   can be updated independently without side effects.

  3.2: misc

   Execution of commands with wild-cards

   I  would  like  to  execute  a  command  using  the  /usr/sbin/vserver
   front-end, but I would like to see the shell wild-card expanded on the
   other side (inside the vserver).

   If you do

/usr/sbin/vserver server exec command \*

   You  end  up  with  \*  passed  to the command directly, without shell
   expansion. The /usr/sbin/vserver front-end is preserving the arguments
   as  much  as  possible.  So  if you escaped something to prevent shell
   expansion, it will remain that way.

   The  trick  is  to use a shell on the other side (in the vserver). The
   command is simply rewritten like this:

/usr/sbin/vserver server exec /bin/sh -c "command *"

   How does this differs from the BSD jail system call

   It  differs  a  little. It is somewhat more flexible because it uses 3
   system calls (chroot, set_ipv4root, new_s_context) to achieve the job.
   So each system call may be used independently.

   For example, if you want to limit xinetd service in the root server to
   a single IP, you can do

        /usr/sbin/chbind --ip eth0 /etc/rc.d/init.d/xinetd restart

   The  package  provides  the  v_xinetd for this purpose. So to get this
   going,   you   need   very  little  reconfiguration.  No  fiddling  in
   configuration files and so on.

   I  am unsure about the jail system call and the new_s_context() I have
   implemented  though.  The  later  is  used to isolate the process in a
   private  world where it can't see and interact with other processes in
   the  box,  except  itself.  The  new_s_context is not privileged, so a
   normal  user  can  use this to, for example, setup a personal security
   box before executing a not-so-trusted game.

   Also the new_s_context() syscall allow root user in the root server to
   "enter"  a  running  vserver, unlike the jail syscall (which can't add
   new  processes  to a running "jail"). On this side, the implementation
   is also more flexible. This is very useful, because it allows the root
   server to monitor the vservers and to start and stop them very easily,
   in a clean way.

   How many vservers may run at once ?

   A vserver does not use any resource by itself. There is no "invisible"
   overhead  for  each vserver. The overhead comes from the tasks you are
   running inside the vserver. In general a vserver will run minimally

     * syslogd
     * crond
     * sometime sshd

   So  this  is  the overhead. Now each vserver will do something useful.
   Run  apache  or run mysql for example. Running a task inside a vserver
   uses the same resources as running it outside (a vserver).

   Memory wise, because of the unification, most task will be sharing the
   text (program code), so this is fairly efficient.

   Now you may want to run very specialized vservers, potentially running
   a single task without cron and syslog. So goes down the overhead.

   For  sure  it  also  depends on the activity of the services. The real
   issue  is  probably  there. If you run 50 vservers each running apache
   and taking enough hit, you may have performance problem.

   Anyway, you will have to try. All I can say is that vserver do not use
   resource by itself. It only depends on the apps you are running inside
   and they are using the same resources inside or outside a vserver.

   PS:  If  you  run cron on redhat distro, before of task like updatedb.
   With  10  vservers they will all wake up at 4 in the morning. The load
   will go up. You may want to disable this.

   What about performance

   You  can expect the exact same performance in a vserver as compared to
   the  root  server.  There  is  no  overhead.  Processes running in the
   vserver  are  talking  directly  to  the kernel. Only few system calls
   (kill for one) have special checks to insure processes isolation.

  3.3: starting

   Is it possible to execute some tasks when a vserver is started

   The  vserver  utility  checks if there is a file /etc/vservers/name.sh
   when  it  is  operating a vserver called "name". This file is a script
   and  is  called in four case: Before starting a vserver, after, before
   stopping   it   and  after.  The  first  argument  one  of  pre-start,
   post-start, pre-stop and post-stop. The second argument is the name of
   the vserver. A typical script looks like:

        #!/bin/sh
        case $1 in
        pre-start)
                mount --bind /home /vservers/$2/home
                ;;
        post-start)
                ;;
        pre-stop)
                ;;
        post-stop)
                umount /vservers/$2/home
                ;;
        esac

4: issues

  4.1: applications

   bind does not work in a vserver (capset failed)

   The  bind  package  expect to have the capability CAP_SYS_RESOURCE. It
   expects this because it may need to increase its ulimit. By default, a
   vserver  does  not  have  this  capability. A vserver starts with some
   ulimit  values and can only reduce them, not enlarge them. The idea is
   to control what a vserver can use.

   To  fix that, one can give the capability to the vserver running bind.
   Edit  the vserver configuration file (/etc/vservers/*.conf) and modify
   the S_CAPS line like this

        S_CAPS="CAP_SYS_RESOURCE"

   Using DHCP server in a vserver

   Since  2.4.18ctx-9,  this  is  possible,  but  there  is  a catch. The
   set_ipv4root assign one IP and one broadcast address to a vserver. UDP
   service  listening  to  0.0.0.0  (bind  any)  in  a vserver are indeed
   listening  to  the  vserver IP and vserver broadcast. This is all they
   will get. This is fine for most service.

   Unfortunately,  dhcpd is receiving special broadcasts. Its clients are
   unaware  of  their  IP  number,  so  they  are using special broadcast
   255.255.255.255 address.

   A  vserver  generally  runs  with the broadcast address of the network
   device (the one used to setup the IP alias). This network device has a
   broadcast  address  which  is  never  255.255.255.255.  Those  special
   broadcast  are  not  sent  to  the vserver. The solution is to set the
   IPROOTBCAST entry in the vserver configuration file like this

IPROOTBCAST=255.255.255.255

   Restart  your  vserver and dhcpd will work. There is a catch (at least
   with  2.4.18ctx-9).  If  you  are  using  other  services  in the save
   vserver,  also  relying  on  broadcast for proper operation (samba for
   one), they won't operate properly.

   One  solution  would  be  to  enhance the semantic a little: A vserver
   would  listen  for  its IP address, its broadcast address and also for
   255.255.255.255. The dhcpd case is probably very specific though.

   Btw,  we are running dhcpd in a vserver because we are using heartbeat
   to provide failover for this service as well.

5: security

  5.1: misc

   Vservers can write to /dev/random, is this a problem ?

   I found the following post on linux-kernel

   [7]
   
   which states:
   
          No,  writing  to  /dev/random  does  not  feed  update  entropy
          estimate.  It  does  mix  data  into  the  pool, but the mixing
          algorithm  is designed so that you can do no harm by mixing any
          data  into  the pool --- even nasty data chosen by an attacker.
          Hence,  allowing someone to write into /dev/random is perfectly
          safe;  it can cause no damage, and might improve things. That's
          why  /dev/random  should be world-writable. There is a separate
          ioctl which requires root privs to atomically mix data into the
          pool  and  update  the  entropy  estimate. That's the interface
          which is supposed to be used by trusted daemons which pull data
          from various hardware devices, and feed them into /dev/random.

   So  writing  is  safe. How about ioctls. Some may indeed influence the
   entropy  pool.  But  they  are  already protected by the CAP_SYS_ADMIN
   capability, so even root in a virtual private server can't use them.

  5.2: principles

   Is a chroot() environment really unbreakable

   Since  the kernel 2.4.17ctx-6, all issues with chroot are now plugged.
   root  inside  a vserver, even with the CAP_SYS_CHROOT capability can't
   escape out.

   Here are the usual tricks used to escape a chroot environment.

     * Open the hard drive device
       There  are  two ways one may escape from a chroot environment. One
       is  to  setup  a  block  device  special  file,  open it with some
       user-land  file  system  browser. At this point, you are bypassing
       both the chroot restriction and all file system access control.
       This does not work on a vserver because the special device are not
       created in /dev and the CAP_SYS_MKNOD capability is disabled. Even
       root can't create special files in a vserver.
       So this hole is plugged.
     * Changing the root while keeping the current directory behind.
       This is a trick to exploit a flaw in the chroot() system call. The
       system call is changing the logical root directory for the calling
       process  but  does not change the current working directory of the
       process.  So  the  current  working  directory is (generally) left
       behind the new root directory.
       The  /usr/sbin/chroot  command takes care of this flaw by changing
       the working directory, but the system call does not do that.
       So  if  you  process  has  now  a working directory behind (out of
       scope)  the new root, it is allow to change it up to the real root
       by  doing  multiple  chdir ("..") system call. The process kind of
       escape from the kernel radar.
       This  is  a  common  bug found on Unix chroot() and Linux had this
       flaw  also.  Kernel  2.4.13 has been tested and does not show this
       problem.

   So it seems chroot() is safe. Anyone has more information about this ?

Rfrences

   1. http://www.paul.sladen.org/vserver/faq
   2. ftp://ftp.solucorp.qc.ca/pub/vserver/patches/patch-2.4.13ctx-3-ext3
   3. mailto:deacon@thedeacon.org
   4. http://thedeacon.org/patches
   5. http://vserver.digitalangel.com.au/patch-2.2.20ctx-8
   6. http://vserver.digitalangel.com.au/
   7. http://www.uwsg.iu.edu/hypermail/linux/kernel/0012.2/0502.html
