Accelerator Plus User Guide
Accelerator Plus is a high-performance hierarchical scheduler designed for distributed High Performance Computing (HPC) environments.
Accelerator Plus is based on the patented concepts described in US Patent 9,658,893 about multi-layered resource scheduling.
The current implementation is designed to be run in conjunction with a base scheduler such as Accelerator™, or Altair PBS Professional©.
With its sub-millisecond latency, Accelerator Plus improves the throughput of difficult workloads, especially those consisting of large numbers of short duration jobs perhaps with complex dependencies, while off-loading the base scheduler. Accelerator Plus allows any user or group to have their own high-performance scheduler without requiring the intervention of the IT department. Since all computing resources are negotiated by means of the base scheduler, Accelerator Plus always obeys all policies established by IT with respect to sharing such resources.
Theory of Operation
During the initial setup, the Accelerator Plus host server (vovserver) establishes a main port for communication and additional ports for web access and read-only access. Afterwards, the vovserver waits for and responds to incoming connection requests from clients.
Clients consist of regular clients that request a particular service, taskers (server farms) that provide computing resources, and notify clients that listen for events.
A fresh instance of Accelerator Plus typically has only one persistent or permanent tasker, dedicated to launching requests to get more taskers from the underlying base scheduler, depending on the workload.
Regular clients can submit the workload, which consists of one or more jobs, or query data about jobs or system status. When a job is created, it is placed in a queued state. Queued jobs are sorted into buckets. Jobs that have the same characteristics go in the same bucket.
Each job bucket is analyzed, by an external daemon called vovwxd. If a bucket is waiting for hardware resources, then the external daemon issues a request to the underlying base scheduler for resources that match that job bucket. In other words, Accelerator Plus requests from the base scheduler a tasker that can run the jobs in a specific bucket. Once the base scheduler grants the request by running a proxy job, the submitted wx-tasker connects back to the Accelerator Plus instance advertising the available resources. Jobs from the matching bucket begin executing without any further intervention from the base scheduler. Multiple buckets and multiple jobs from each bucket can be serviced concurrently. With a large base scheduler and a significant workload, thousands of jobs can be run concurrently.
When a job completes, the wx-tasker notifies the vovserver. The resources, both tasker-based and central, are recovered, allowing subsequent jobs (queued in the buckets) to be dispatched. When completed, the job status is updated to either VALID or FAILED.
In addition to dispatching jobs and processing their status, the vovserver responds to queries about system and job requests, publishes events to notify clients, and continues to process incoming job requests.
Examples of Modes of Operation
- Single User Mode, Persistent
- Here a Accelerator Plus instance is started on a dedicated compute node using a role account. Another application, for example a Jenkins build server, is used to create the workload. In this scenario, Accelerator Plus is used primarily as an efficient distributed build engine, interfacing with the base scheduler. Multiple Accelerator Plus instances can be deployed concurrently to accelerate multiple flows in the form of execution "lanes." The underlying scheduler is used to balance the resource allocation across the Accelerator Plus instances.
- Single User Mode, On-Demand
- Similar to the first mode but this time the Accelerator Plus instance itself is also run on the underlying batch system. Upon completion of the workload, the Accelerator Plus instance is halted and all compute resources are returned to the farm. This model is useful for occasional, self-contained resource intensive workloads.
- Multi User Mode, Persistent
- This mode implements full hierarchical scheduling. The Accelerator Plus instance runs on a dedicated node with a publicly known host name and port number. Multiple Accelerator Plus instances can be used concurrently to provide each team with their own scheduler. While it is possible to allocate Accelerator Plus instances on a per-project basis, the preferred allocation method is on a functional or workload basis. For example, providing an Accelerator Plus instance for each of the Design Verification, Circuit Design and Physical Design teams allows similar work flows to be grouped together on a single Accelerator Plus instance. Commonality of work flow within an Accelerator Plus instance allows more optimal tuning while sharing a common base scheduler.