ISOLATE(1) ========== NAME ---- isolate - Isolate a process using Linux Containers SYNOPSIS -------- *isolate* 'options' *--init* *isolate* 'options' *--run* +--+ 'program' 'arguments' *isolate* 'options' *--cleanup* DESCRIPTION ----------- Run 'program' within a sandbox, so that it cannot communicate with the outside world and its resource consumption is limited. This can be used for example in a programming contest to run untrusted programs submitted by contestants in a controlled environment. The sandbox is used in the following way: * Run *isolate --init*, which initializes the sandbox, creates its working directory and prints its name to the standard output. Fails if the sandbox already existed. * Populate the directory with the executable file of the program and its input files. * Call *isolate --run* to run the program. A single line describing the status of the program is written to the standard error stream. * Fetch the output of the program from the directory. * Run *isolate --cleanup* to remove temporary files. Does nothing if the sandbox was already cleaned up. Please note that by default, the program is not allowed to start multiple processes of threads. If you need that, turn on the control group mode (see below). OPTIONS ------- *-M, --meta=*'file':: Output meta-data on the execution of the program to a given file. See below for syntax of the meta-files. *-m, --mem=*'size':: Limit address space of the program to 'size' kilobytes. If more processes are allowed, this applies to each of them separately. *-t, --time=*'time':: Limit run time of the program to 'time' seconds. Fractional numbers are allowed. Time in which the OS assigns the processor to different tasks is not counted. *-w, --wall-time=*'time':: Limit wall-clock time to 'time' seconds. Fractional values are allowed. This clock measures the time from the start of the program to its exit, so it does not stop when the program has lost the CPU or when it is waiting for an external event. We recommend to use *--time* as the main limit, but set *--wall-time* to a much higher value as a precaution against sleeping programs. *-x, --extra-time=*'time':: When a time limit is exceeded, wait for extra 'time' seconds before killing the program. This has the advantage that the real execution time is reported, even though it slightly exceeds the limit. Fractional numbers are again allowed. *-b, --box-id=*'id':: When you run multiple sandboxes in parallel, you have to assign each unique IDs to them by this option. See the discussion on UIDs in the INSTALLATION section. The ID defaults to 0. *-k, --stack=*'size':: Limit process stack to 'size' kilobytes. By default, the whole address space is available for the stack, but it is subject to the *--mem* limit. *-f, --fsize=*'size':: Limit size of files created (or modified) by the program to 'size' kilobytes. In most cases, it is better to restrict overall disk usage by a disk quota (see below). This option can help in cases when quotas are not enabled on the underlying filesystem. *-q, --quota=*'blocks'*,*'inodes':: Set disk quota to a given number of blocks and inodes. This requires the filesystem to be mounted with support for quotas. Please note that this currently works only on the ext family of filesystems (other filesystems use other interfaces for setting quotas). *-i, --stdin=*'file':: Redirect standard input from 'file'. The 'file' has to be accessible inside the sandbox. Otherwise, standard input is inherited from the parent process. *-o, --stdout=*'file':: Redirect standard output to 'file'. The 'file' has to be accessible inside the sandbox. Otherwise, standard output is inherited from the parent process and the sandbox manager does not write anything to it. *-r, --stderr=*'file':: Redirect standard error output to 'file'. The 'file' has to be accessible inside the sandbox. Otherwise, standard error output is inherited from the parent process. See also *--stderr-to-stdout*. *--stderr-to-stdout*:: Redirect standard error output to standard output. This is performed after the standard output is redirected by *--stdout*. Mutually exclusive with *--stderr*. *-c, --chdir=*'dir':: Change directory to 'dir' before executing the program. This path must be relative to the root of the sandbox. *-p, --processes*[*=*'max']:: Permit the program to create up to 'max' processes and/or threads. Please keep in mind that time and memory limit do not work with multiple processes unless you enable the control group mode. If 'max' is not given, an arbitrary number of processes can be run. By default, only one process is permitted. *--share-net*:: By default, isolate creates a new network namespace for its child process. This namespace contains no network devices except for a per-namespace loopback. This prevents the program from communicating with the outside world. If you want to permit communication, you can use this switch to keep the child process in parent's network namespace. *--inherit-fds*:: By default, isolate closes all file descriptors passed from its parent except for descriptors 0, 1, and 2. This prevents unintentional descriptor leaks. In some cases, passing extra descriptors to the sandbox can be desirable, so you can use this switch to make them survive. *-v, --verbose*:: Tell the sandbox manager to be verbose and report on what is going on. Using *-v* multiple times produces even more jabber. *-s, --silent*:: Tell the sandbox manager to keep silence. No status messages are printed to stderr except for fatal errors of the sandbox itself. The combination of *--verbose* and *--silent* has an undefined effect. ENVIRONMENT RULES ----------------- UNIX processes normally inherit all environment variables from their parent. The sandbox however passes only those variables which are explicitly requested by environment rules: *-E, --env=*'var':: Inherit the variable 'var' from the parent. *-E, --env=*'var'*=*'value':: Set the variable 'var' to 'value'. When the 'value' is empty, the variable is removed from the environment. *-e, --full-env*:: Inherit all variables from the parent. The rules are applied in the order in which they were given, except for *--full-env*, which is applied first. The list of rules is automatically initialized with *-ELIBC_FATAL_STDERR_=1*. DIRECTORY RULES --------------- The sandboxed process gets its own filesystem namespace, which contains only subtrees requested by directory rules: *-d, --dir=*'in'*=*'out'[*:*'options']:: Bind the directory 'out' as seen by the caller to the path 'in' inside the sandbox. If there already was a directory rule for 'in', it is replaced. *-d, --dir=*'dir'[*:*'options']:: Bind the directory +/+'dir' to 'dir' inside the sandbox. If there already was a directory rule for 'in', it is replaced. *-d, --dir=*'in'*=*:: Remove a directory rule for the path 'in' inside the sandbox. By default, all directories are bound read-only and restricted (no devices, no setuid binaries). This behavior can be modified using the 'options': *rw*:: Allow read-write access. *dev*:: Allow access to character and block devices. *noexec*:: Disallow execution of binaries. *maybe*:: Silently ignore the rule if the directory to be bound does not exist. *fs*:: Instead of binding a directory, mount a device-less filesystem called 'in'. For example, this can be 'proc' or 'sysfs'. Unless *--no-default-dirs* is specified, the default set of directory rules binds +/bin+, +/dev+ (with devices allowed), +/lib+, +/lib64+ (if it exists), and +/usr+. It also binds the working directory to +/box+ (read-write) and mounts the proc filesystem at +/proc+. *-D, --no-default-dirs*:: Do not bind the default set of directories. Care has to be taken to specify the correct set of rules (using *--dir*) for the executed program to run correctly. In particular, +/box+ has to be bound. CONTROL GROUPS -------------- Isolate can make use of system control groups provided by the kernel to constrain programs consisting of multiple processes. Please note that this feature needs special system setup described in the INSTALLATION section. *--cg*:: Enable use of control groups. This should be specified with *--init*, *--run* and *--cleanup*. *--cg-mem=*'size':: Limit total memory usage by the whole control group to 'size' kilobytes. This should be specified with *--run*. *--cg-timing*:: Use control groups for timing, so that the *--time* switch affects the total run time of all processes and threads in the control group. This should be specified with *--run*. This option is turned on by default, use *--no-cg-timing* to turn off. META-FILES ---------- The meta-file contains miscellaneous meta-information on execution of the program within the sandbox. It is a textual file consisting of lines of format 'key'*:*'value'. The following keys are defined: *cg-mem*:: When control groups are enabled, this is the total memory use by the whole control group (in kilobytes). *cg-oom-killed*:: Present when the program was killed by the out-of-memory killer (e.g., because it has exceeded the memory limit of its control group). This is reported only on Linux 4.13 and later. *csw-forced*:: Number of context switches forced by the kernel. *csw-voluntary*:: Number of context switches caused by the process giving up the CPU voluntarily. *exitcode*:: The program has exited normally with this exit code. *exitsig*:: The program has exited after receiving this fatal signal. *killed*:: Present when the program was terminated by the sandbox (e.g., because it has exceeded the time limit). *max-rss*:: Maximum resident set size of the process (in kilobytes). *message*:: Status message, not intended for machine processing. E.g., "Time limit exceeded." *status*:: Two-letter status code: * *RE* -- run-time error, i.e., exited with a non-zero exit code * *SG* -- program died on a signal * *TO* -- timed out * *XX* -- internal error of the sandbox *time*:: Run time of the program in fractional seconds. *time-wall*:: Wall clock time of the program in fractional seconds. Please note that not all keys have to be present. For example, no *status* nor *message* is reported upon normal termination. RETURN VALUE ------------ When the program inside the sandbox finishes correctly, the sandbox returns 0. If it finishes incorrectly, it returns 1. All other return codes signal an internal error. INSTALLATION ------------ Isolate depends on several advanced features of the Linux kernel. Please make sure that your kernel supports PID namespaces (+CONFIG_PID_NS+), IPC namespaces (+CONFIG_IPC_NS+), and network namespaces (+CONFIG_NET_NS+). If you want to use control groups, you need the cpusets (+CONFIG_CPUSETS+), CPU accounting controller (+CONFIG_CGROUP_CPUACCT+), and memory resource controller (+CONFIG_MEMCG+). If your machine has swap enabled, you should also enable the swap controller (+CONFIG_MEMCG_SWAP+). Debian 7.x and newer require enabling the memory and swap cgroup controllers by adding the parameters "cgroup_enable=memory swapaccount=1" to the kernel command-line, which can be set using +GRUB_CMDLINE_LINUX_DEFAULT+ in /etc/default/grub. Isolate is designed to run setuid to root. The sub-process inside the sandbox then switches to a non-privileged user ID (different for each *--box-id*). The range of UIDs available and several filesystem paths are set in a configuration file, by default located in /usr/local/etc/isolate. Before you run isolate with control groups, you need to ensure that the cgroup filesystem is enabled and mounted. Most modern Linux distributions already provide cgroup support through a tmpfs mounted at /sys/fs/cgroup, with individual controllers mounted within subdirectories. REPRODUCIBILITY --------------- The reproducibility of results can be improved by tuning some kernel parameters, listed below. Some of these parameters can be checked using the program isolate-check-environment. * Disable address space randomization: +sysctl kernel.randomize_va_space=0+. Address space randomization can affect timing, memory usage, and program behavior. This setting can be made persistent through /etc/sysctl.d/. * Disable dynamic CPU frequency scaling. This requires setting the cpufreq scaling governor to +performance+. The process for doing this varies between distributions. * Consider disabling Turboboost on CPUs that might support it (most i3/i5/i7 Intel CPUs). Approach this one with caution. Disabling a CPU that Turboboosts from 2.3 GHz to 2.6 GHz would have minimal impact on run-times in exchange for determinism, but the same on a CPU that Turboboosts from 1.6 GHz to 2.8 GHz will incur a much more dramatic slowdown. Perhaps if the ambient temperature is controlled and only one single-threaded task is keeping the CPU busy at 100%, then TB's behaviour may be reasonably deterministic; requires further experimentation to confirm. * Run evaluations on a single CPU (core). The Linux scheduler has a tendency to randomly migrate tasks between CPUs, incurring cache migration costs. You can use isolate's configuration file to pin the process to a specified CPU. * Disable automatic kernel support for transparent huge pages. Both /sys/kernel/mm/transparent_hugepage/enabled and /sys/kernel/mm/transparent_hugepage/defrag should be set to "madvise" or "never", and /sys/kernel/mm/transparent_hugepage/khugepaged/defrag to 0. * Disable swapping. If you really need swap space and you are using cgroups, make sure that you have the memsw controller enabled, so that swap space is properly accounted for. LICENSE ------- Isolate was written by Martin Mares and Bernard Blackham. It can be distributed and used under the terms of the GNU General Public License version 2 or any later version.