skill-tree:k:4:1:b
K4.1 Basic principles of Job Scheduling
This skill provides an overview of the scheduling of jobs on a supercomputer. It covers generic and widely used concepts that serve the purpose to maximize the efficiency of a supercomputer.
Batch jobs submitted to a job queue define the workloads in batch systems. A workload manager of a cluster system typically deals with:
- Job Control to provide a user interface for submitting jobs to job queues, monitoring their state during processing (e.g. to check their estimated starting time), and intervening in their execution (e.g. to abort them manually)
- Scheduling and Resource Management to select a waiting job for execution and to allocate nodes to the job meeting all its other demands for computing resources (memory, special processing elements like GPUs, etc.)
- Accounting to record historical data about how many computing resources (e.g. computing time) have been consumed by a job
Learning Outcomes
- Comprehend the exclusive and shared usage model in HPC.
- Differentiate batch and interactive job submission.
- Comprehend the generic concepts and architecture of resource manager, scheduler, job and job script.
- Explain environment variables as a means to communicate.
- Comprehend accounting principles.
- Explain the generic steps to run and monitor a single job.
- Comprehend scheduling principles (first come first served, shortest job first, backfilling) to achieve objectives like minimizing the averaged elapsed program runtimes, and maximizing the utilization of the available HPC resources.
- Comprehend the differences between Batch Systems and Time-Sharing Systems.
- Explain the concepts and procedures for resource allocation and job execution in an HPC environment.
- Run interactive jobs and batch jobs.
- Comprehend and describe the expected behavior of job scripts.
- Change provided job scripts and embed them into shell scripts to run a variety of parallel applications.
- Analyze the output generated from a job scheduler and describe the cause of typically generated errors.
- Comprehend accounting principles (billing for the jobs).
- Comprehend the set of terms for performance criteria like:
- Resource Utilization.
- Throughput.
- Waiting Time.
- Execution Time.
- Turnaround Time.
- Comprehend scheduling strategies that increase productivity.
- Comprehend that typical goals of job scheduling are:
- Maximization of resource utilization.
- Maximization of throughput.
- Minimization of waiting time.
- Minimization of turnaround time.
- Comprehend that there is a variety of scheduling algorithms from rather simple to more complex like:
- First-Come-First-Served (FCFS).
- Shortest-Job-First (SJF).
- Priority-based.
- Fair-Share.
- Backfilling.
- Apply advanced scheduling principles (e.g. backfilling) to achieve objectives like minimizing the averaged elapsed program runtimes, and maximizing the utilization of the available HPC resources.
- Discuss sophisticated scheduling principles (e.g. fair share) to achieve objectives like treating the users fair, and maximizing the utilization of the available HPC resources.
skill-tree/k/4/1/b.txt · Last modified: 2025/04/16 18:30 by 127.0.0.1