CFS Bandwidth Control (aka CPU hard limits)
===========================================

[ This document talks about CPU bandwidth control of CFS groups only.
  The bandwidth control of RT groups is explained in
  Documentation/scheduler/sched-rt-group.txt ]

CFS bandwidth control is a group scheduler extension that can be used to
control the maximum CPU bandwidth obtained by a CPU cgroup.

Bandwidth allowed for a group is specified using quota and period. Within
a given "period" (microseconds), a group is allowed to consume up to "quota"
microseconds of CPU time, which is the upper limit or the hard limit. When the
CPU bandwidth consumption of a group exceeds the hard limit, the tasks in the
group are throttled and are not allowed to run until the end of the period at
which time the group's quota is replenished.

Runtime available to the group is tracked globally. At the beginning of
every period, group's global runtime pool is replenished with "quota"
microseconds worth of runtime. The runtime consumption happens locally at each
CPU by fetching runtimes in "slices" from the global pool.

Interface
---------
Quota and period can be set via cgroup files.

cpu.cfs_quota_us: the enforcement interval (microseconds)
cpu.cfs_period_us: the maximum allowed bandwidth (microseconds)

Within a period of cpu.cfs_period_us, the group as a whole will not be allowed
to consume more than cpu_cfs_quota_us worth of runtime.

The default value of cpu.cfs_period_us is 500ms and the default value
for cpu.cfs_quota_us is -1.

A group with cpu.cfs_quota_us as -1 indicates that the group has infinite
bandwidth, which means that it is not bandwidth controlled.

Writing any negative value to cpu.cfs_quota_us will turn the group into
an infinite bandwidth group. Reading cpu.cfs_quota_us for an infinite
bandwidth group will always return -1.

System wide settings
--------------------
The amount of runtime obtained from global pool every time a CPU wants the
group quota locally is controlled by a sysctl parameter
sched_cfs_bandwidth_slice_us. The current default is 5ms. This can be changed
by writing to /proc/sys/kernel/sched_cfs_bandwidth_slice_us.

A quota hierarchy is defined to be consistent if the sum of child reservations
does not exceed the bandwidth allocated to its parent.  An entity with no
explicit bandwidth reservation (e.g. no limit) is considered to inherit its
parent's limits.  This behavior may be managed using
/proc/sys/kernel/sched_cfs_bandwidth_consistent

Statistics
----------
cpu.stat file lists three different stats related to CPU bandwidth control.

nr_periods: Number of enforcement intervals that have elapsed.
nr_throttled: Number of times the group has been throttled/limited.
throttled_time: The total time duration (in nanoseconds) for which the group
remained throttled.

These files are read-only.

Hierarchy considerations
------------------------
Each group's bandwidth (quota and period) can be set independent of its
parent or child groups. There are two ways in which a group can get
throttled:

- it consumed its quota within the period
- it has quota left but the parent's quota is exhausted.

In the 2nd case, even though the child has quota left, it will not be
able to run since the parent itself is throttled. Similarly groups that are
not bandwidth constrained might end up being throttled if any parent
in their hierarchy is throttled.

Examples
--------
1. Limit a group to 1 CPU worth of runtime.

	If period is 500ms and quota is also 500ms, the group will get
	1 CPU worth of runtime every 500ms.

	# echo 500000 > cpu.cfs_quota_us /* quota = 500ms */
	# echo 500000 > cpu.cfs_period_us /* period = 500ms */

2. Limit a group to 2 CPUs worth of runtime on a multi-CPU machine.

	With 500ms period and 1000ms quota, the group can get 2 CPUs worth of
	runtime every 500ms.

	# echo 1000000 > cpu.cfs_quota_us /* quota = 1000ms */
	# echo 500000 > cpu.cfs_period_us /* period = 500ms */

3. Limit a group to 20% of 1 CPU.

	With 500ms period, 100ms quota will be equivalent to 20% of 1 CPU.

	# echo 100000 > cpu.cfs_quota_us /* quota = 100ms */
	# echo 500000 > cpu.cfs_period_us /* period = 500ms */
