User Tools

Site Tools


skill-tree:k:4:1:b

K4.1 Basic principles of Job Scheduling

This skill provides an overview of the scheduling of jobs on a supercomputer. It covers generic and widely used concepts that serve the purpose to maximize the efficiency of a supercomputer.

Batch jobs submitted to a job queue define the workloads in batch systems. A workload manager of a cluster system typically deals with:

  • Job Control to provide a user interface for submitting jobs to job queues, monitoring their state during processing (e.g. to check their estimated starting time), and intervening in their execution (e.g. to abort them manually)
  • Scheduling and Resource Management to select a waiting job for execution and to allocate nodes to the job meeting all its other demands for computing resources (memory, special processing elements like GPUs, etc.)
  • Accounting to record historical data about how many computing resources (e.g. computing time) have been consumed by a job

Learning Outcomes

  • Comprehend the exclusive and shared usage model in HPC.
  • Differentiate batch and interactive job submission.
  • Comprehend the generic concepts and architecture of resource manager, scheduler, job and job script.
  • Explain environment variables as a means to communicate.
  • Comprehend accounting principles.
  • Explain the generic steps to run and monitor a single job.
  • Comprehend scheduling principles (first come first served, shortest job first, backfilling) to achieve objectives like minimizing the averaged elapsed program runtimes, and maximizing the utilization of the available HPC resources.
  • Comprehend the differences between Batch Systems and Time-Sharing Systems.
  • Explain the concepts and procedures for resource allocation and job execution in an HPC environment.
  • Run interactive jobs and batch jobs.
  • Comprehend and describe the expected behavior of job scripts.
  • Change provided job scripts and embed them into shell scripts to run a variety of parallel applications.
  • Analyze the output generated from a job scheduler and describe the cause of typically generated errors.
  • Comprehend accounting principles (billing for the jobs).
  • Comprehend the set of terms for performance criteria like:
    • Resource Utilization.
    • Throughput.
    • Waiting Time.
    • Execution Time.
    • Turnaround Time.
  • Comprehend scheduling strategies that increase productivity.
  • Comprehend that typical goals of job scheduling are:
    • Maximization of resource utilization.
    • Maximization of throughput.
    • Minimization of waiting time.
    • Minimization of turnaround time.
  • Comprehend that there is a variety of scheduling algorithms from rather simple to more complex like:
    • First-Come-First-Served (FCFS).
    • Shortest-Job-First (SJF).
    • Priority-based.
    • Fair-Share.
    • Backfilling.
  • Apply advanced scheduling principles (e.g. backfilling) to achieve objectives like minimizing the averaged elapsed program runtimes, and maximizing the utilization of the available HPC resources.
  • Discuss sophisticated scheduling principles (e.g. fair share) to achieve objectives like treating the users fair, and maximizing the utilization of the available HPC resources.
skill-tree/k/4/1/b.txt · Last modified: 2025/04/16 18:30 by 127.0.0.1