FairShare
FairShare allocates CPU cycles among groups and users
according to policies defined by the administrators. FairShare is the dominant criteria in the scheduler, more
important than job priorities. The fairness of CPU time allocation is computed using
a multi-level FairShare tree. Each node in the tree is
called a FairShare Group (abbreviated fsgroup
) and is
characterized by a name, a weight, and a time
window.
Each job belongs to one and only one fsgroup
. The contribution of
each job to the FairShare mechanism is controlled by the
parameter fstokens, which is 1 by default. If
fstokens is 0, then the job will not contribute anything to
the actual share (this is only rarely useful). If fstokens is
2, then the job contributes twice as much as a regular job with
fstokens set to 1.
An fsgroup
is "active" if some of its jobs are either running or
queued, or, recursively, if any of its children is active.
The FairShare tree can be surprisingly large. Some
organization have more than 16,000 nodes in the tree, on account of the large number
of projects and users, although typically, at any one time, only less than 100
fsgroups
are active.
fsgroup
so that the actual share of resources is as
close as possible to the target share defined by the weights. To be clear, the
fsgroups
that are not active are not considered in the FairShare algorithm.- The target share is computed from the
fsgroup
weights and allocated to all activefsgroups
. The inactive fsgroups get a target of 0% - The actual share is computed from the contribution of each job according to the overlap of the job execution and the time window multiplied by the fstokens parameter. The actual share consists of two components: the "running actual share" based on the number of jobs currently running, weighted by fstokens, and the "historical actual share" based on the overlap of the job execution time and the time window, also weighted by fstokens.
Each active fsgroup
is assigned a 'rank' computed from the
difference between its target share and actual share. That rank is then assigned
implicitly to all jobs that belong to that fsgroup
. The
fsgroup
that has the highest deficit will get rank of 0, while
the fsgroup
with the largest excess share will get a large rank
(depending on current number of active fsgroups
). The rank
determines which jobs are preferred for dispatch, so that the scheduler first
considers dispatching the jobs that have lower rank, i.e. jobs from
fsgroups
under their target share. If those jobs cannot be
dispatched, because of other constraints such as RAM or limits, then the scheduler
considers jobs with higher rank.
The default FairShare window is 2 hours meaning that we consider the time interval starting 2 hours before the present. Normally all nodes in the FairShare tree have the same time window, but that is not a requirement. In particular, it is possible to set the time window to zero in a node to disable FairShare for nodes under that node, by setting the rank of all children to the same value.
Selecting the appropriate window size is a balance between responsiveness and accuracy. A wide time window required more computation than a narrow window. The average job length and overall daily workload should be taken into account when selecting an appropriate window size.
As a rule of thumb, if your workload is small, i.e. under 100,000 jobs per day, do not worry about the FairShare window. If your workload exceeds 100,000 jobs per day, perhaps you want to use a shorter time window, such as 10 minutes. Workloads of millions of jobs per day can benefit with a time window of 2 minutes. Also relevant here is the frequency of update of the actual shares, which is controlled by the parameter fairshare.updatePeriod. The default value for this parameter is 0, meaning that the FairShare data is updated a frequently as needed, perhaps multiple times a second. For large workloads it may be a good idea to set that parameter to 3 or 5 seconds.
FairShare Rank
A fsgroups's
rank ranges from zero upward. Jobs from
fsgroups
ranked closest to zero are preferred for dispatch. The
rank is computed by ordering the fsgroups
by a combined 'distance'
between its target share and actual share. This distance has a 'running' component
and a 'historical' component, based on jobs in the window. The vovserver configuration parameters
fairshare.relative and
fairshare.relative_alpha control the influence of the
historical versus running distance on the actual rank. Refer to Server Configuration for details.
- Multiple multi-level FairShare trees are supported. The default number of levels is 2.
- Each node in a FairShare tree has its own window size and weight.
- Ability to disable FairShare for a sub-tree by setting the window size to zero.
- Privileges are controlled with Access Control Lists (ACLs) for fine grained control.
FairShare Tree Naming Conventions
Each fsgroup
has a hierarchical name where the components are
separated by a "/", similar to a file name. The default fsgroup
is
/time/users
. The name can take one of the following three
forms:
Type | Form | Example |
---|---|---|
FS-Group | HIERARCHICAL_GROUP_NAME | /time/users |
FS-User | HIERARCHICAL_GROUP_NAME.USER_NAME | /time/users.joe |
FS-Subgroup | HIERARCHICAL_GROUP_NAME.USER_NAME:SUBGROUP_NAME | /time/users.joe:myregression1 |
Each component in the name has to be alpha-numeric, and can contain _. The . character is not allowed except in the FS-User component. The / and : are not allowed anywhere.
Type | Example |
---|---|
FS-Group | /proj/sanjose/library/qa |
FS-User | /proj/sanjose/library/qa.john |
FS-Subgroup | /proj/sanjose/library/qa.john:mytest1 |
Each node in the FairShare tree has an owner who has the
authority to set the weights for all the subnodes in the tree. For example, the
owner of group /time/med
can set the weights for
/time/med/sanjose
and any other nodes of the form
/time/med/*
.
Define FairShare Groups
The FairShare tree is dynamic and can be changed at any time. If you like a configuration, you can save it into a file and then you can reload it at a later time. The main tool to perform these actions is vovfsgroup
FairShare Command Line Utilities
ID GROUP OWNER WEIGHT WINDOW RUNNING QUEUED
000000016 / (server) 0 1h00m 0 0
000001012 /system cadmgr 100 0s 0 0
000001050 /system/taskers cadmgr 100 0s 0 0
000001053 /system/taskers/messages cadmgr 100 0s 0 0
000001056 /system/taskers/reservations cadmgr 100 0s 0 0
000001006 /time cadmgr 100 1h00m 0 0
000001009 /time/users cadmgr 100 1h00m 0 0
000001081 /time/users.cadmgr cadmgr 100 1h00m 0 0
% vovshow -groups
ID GROUP WEIGHT WINDOW
02223424 /system 100 1m00s
02223422 /time 100 1h00m
02223423 /time/users 1 2h00m
02223435 /time/users.cadmgr 100 1h00m
fsgroup
, you can use an additional argument to
vovfsgroup show FSGROUPNAME:
% vovfsgroup show /time/users
Id: 000001009
FullName: /time/users
Owner: cadmgr
Weight: 100
Window: 1h00m
Rank: -1
ACL 1: OWNER "" ATTACH DETACH EDIT VIEW STOP FORGET DELEGATE EXISTS
ACL 2: EVERYBODY "" ATTACH VIEW
000001081 /time/users.cadmgr 100 cadmgr
fsgroups
and to change their
weight. You can try the following commands as the ADMIN user for your Accelerator
instance:% vovproject enable vnc
% vovfsgroup create /app/primetime
% vovfsgroup create /app/spice
% vovfsgroup create /app/other
% vovfsgroup modify /app/primetime weight 300
% vovfsgroup modify /app/spice weight 100
% vovfsgroup modify /app/other weight 20
% vovfsgroup modrec /app window 1h
% vovfsgroup exists /app/other
% vovfsgroup exists /app/not_there
% vovfsgroup genconfig saved_my_cool_config.tcl ### Important to use the .tcl extension
% vovfsgroup delete /app
% vovfsgroup loadconfig saved_my_cool_config.tcl
For more information, refer to Configure FairShare via the vovfsgroup Utility.
Monitor FairShare
% nc monitor
From the browser interface, visit the Project Home page and then select the FairShare link.
Target Share Example
In the following example, it is assumed two groups are defined, the default group
/time/users
and another group named
/time/regr
. The users maureen
and
murali
are members of the /time/users
group.
User john
is a member of the /time/regr
group. It
is also assumed that all users have jobs queued. Following is how the target shares
would be determined using the two-tier method.
/time/regr share = 100/(100+100) 50%
/time/users share = 100/(100+100) 50%
/time/regr
group, only
john
has jobs. That user gets 100% of the group's share or 50%
of the overall cycles.
/time/regr.john share = 100/(100+100) 50% * 100% = 50%
/time/users share = 100/(100+100) 50%
/time/users
group, two users have jobs as shown below:
/time/users.maureen share = 10/(10+10) 50% * 50% grp = 25%
/time/users.murali share = 10/(10+10) 50% * 50% grp = 25%
suresh
, who is a member of the
users
group submits jobs that are queued, the target shares
would change as follows:
/time/users.maureen share = 10/(10+10+10) 33% * 50% grp = 16.7%
/time/users.murali share = 10/(10+10+10) 33% * 50% grp = 16.7%
/time/users.suresh share = 10/(10+10+10) 33% * 50% grp = 16.7%
suresh
just entered the queue, his actual share
will probably be much less than the target share. Therefore, his jobs will be
launched ahead of the other users as the system tries to bring his actual share up
to his target share. An example of the overall FairShare
picture is shown below (target shares shown):
/time/regr.john share = 100/(100+100) 100% * 50% grp = 50.0%
/time/users.maureen share = 10/(10+10+10) 33% * 50% grp = 16.7%
/time/users.murali share = 10/(10+10+10) 33% * 50% grp = 16.7%
/time/users.suresh share = 10/(10+10+10) 33% * 50% grp = 16.7%