Architecture: JES and Batch Processing

Every night, while the bank's tellers have gone home and the CICS transaction load has dropped to near zero, a different kind of work begins. Millions of account records are updated with interest. Every transaction from the trading day is reconciled. Customer statements are generated. Risk models are recalculated. None of this work has a user waiting for a response. None of it runs in real time. It runs in batch.

Batch processing is the other half of the mainframe workload, as fundamental as CICS and far older. It predates interactive computing entirely: the original IBM mainframes processed nothing but batch jobs. The system that manages this workload on z/OS is JES: the Job Entry Subsystem. JES accepts jobs, queues them, schedules them for execution, and handles their output. It has done this continuously, in some form, since the early 1960s.

This post assumes you have read Architecture: Mainframe. It covers how JES2 is structured, how jobs move through the system from submission to completion, how the spool works, and how batch processing integrates with the rest of the z/OS environment.

The Big Picture

JES is a privileged z/OS subsystem that runs in its own address space. It starts early in the IPL sequence, before most other subsystems, because almost everything else depends on it. CICS started tasks are managed through JES. TSO sessions are managed through JES. Even system messages written to the operator console pass through JES. But the primary reason JES exists is batch job processing.

There are two variants: JES2 and JES3. JES2 descends from HASP (Houston Automatic Spooling Priority System), developed at NASA's Johnson Space Center in the 1960s. JES3 descends from ASP (Asymmetric Processing System). As of z/OS 3.1 (released September 2023), IBM no longer ships JES3 as part of z/OS: JES2 is now the standard. JES3 continues as a separately licensed product from Phoenix Software International for sites that have not yet migrated. This post covers JES2.

The core concept is the spool. Rather than reading job input directly from a device and writing output directly to a printer, JES routes everything through a set of DASD datasets called the spool. Job input is written to spool on arrival. Job output is written to spool as it is produced. The spool is the queue: it holds every job and its associated data, at every stage of processing, until the job is purged. This decoupling of input, execution, and output is what makes parallel job processing possible.

The Spool

The spool is a set of dedicated VSAM datasets on DASD. Everything JES2 needs to track about a job lives on the spool: the JCL that was submitted, the SYSIN data (inline input data), the SYSOUT output written by each step, job control blocks, and output class assignments. A large production spool might hold thousands of jobs simultaneously at various stages of processing.

The spool uses its own internal format, not a standard VSAM record organization. JES2 allocates space on the spool in fixed-size blocks called track groups, typically four tracks each. Each job occupies one or more track groups. JES2 maintains an in-memory spool directory that maps job IDs to their spool locations.

Spool space is a finite resource. In a busy production environment, operators monitor spool utilization closely. If the spool fills up, JES2 cannot accept new jobs. Spool housekeeping, purging output that has been printed or released, is a routine operational task. Automated spool management tools are standard in most shops.

The checkpoint dataset is a companion to the spool. JES2 periodically writes its in-memory control blocks to the checkpoint dataset. If JES2 fails and must be restarted, it reads the checkpoint to reconstruct its state: which jobs are on the spool, what stage they are at, which initiators were running. The checkpoint interval is a tuning parameter: too frequent and it adds I/O overhead; too infrequent and a restart takes longer to recover state.

Job Classes and Priorities

Every batch job has two scheduling attributes that determine when it runs: a job class and a priority.

A job class is a single character (A through Z, 0 through 9) specified on the `CLASS=` parameter of the JCL `JOB` statement. Job classes are the primary routing mechanism. A site might define class `A` for short-running production jobs, class `B` for long-running production jobs, class `T` for test jobs, and class `X` for very large jobs that should only run during the weekend window. Initiators are configured to process specific classes, so the class determines which initiators can pick up a job.

Within a class, jobs are ordered by priority: a numeric value from 0 to 15 specified on `PRTY=`. Higher numbers execute first. A priority 15 job in class A will always be selected before a priority 8 job in class A. Priority can be changed while a job is waiting in the queue using a JES2 operator command.

Jobs can also be placed in held status: `TYPRUN=HOLD` on the JCL `JOB` statement submits the job but prevents it from being selected until an operator releases it. This is used for jobs that need manual verification before running, or for pre-staging jobs that will be released at a specific time.

Initiators

An initiator is a z/OS address space whose sole job is to run batch jobs. It sits waiting for work, and when JES2 selects a job for it, it loads the first step's program and drives execution through each step of the job. When the job finishes, the initiator returns to JES2 and asks for the next job.

Initiators are configured with a list of job classes they will process, in priority order. An initiator configured for `C=(A,B,T)` will first look for class A jobs. If none are queued, it looks for class B. If none there, it takes class T. If nothing is available in any of its classes, the initiator waits. This class hierarchy gives operators a simple mechanism for managing throughput: to drain class T (test) jobs during a busy period, configure all initiators away from class T.

The number of initiators running determines the degree of batch parallelism. Ten initiators mean up to ten batch jobs can execute simultaneously. In modern z/OS environments, WLM (Workload Manager) can manage initiators dynamically rather than relying on static operator configuration. WLM starts additional initiators when the job backlog grows and stops them when it shrinks, subject to the performance goals defined in the WLM service policy.

Job Steps and Return Codes

A batch job is not a single program. It is a sequence of steps, each defined by an `EXEC` statement in the JCL. Each step runs one program, called via its load module name. Steps execute sequentially. A step can reference output datasets produced by earlier steps, creating a data pipeline within a single job.

Each step ends with a return code: a numeric value set by the program before it terminates. By convention, 0 means success, 4 means warnings, 8 means errors, 12 or higher means serious failure. The JCL `COND` parameter (or the modern `IF/THEN/ELSE` JCL construct) allows steps to be conditionally skipped based on the return code of a previous step. A step that produces return code 8 can cause all subsequent steps to be bypassed, effectively aborting the job at that point.

Return codes are the primary mechanism for batch job chaining and error handling. A well-designed batch job uses return codes throughout: the extract step signals whether data was found, the sort step signals whether the sort completed cleanly, the update step signals how many records were processed. Monitoring tools and schedulers read return codes to determine whether a job completed successfully and whether dependent jobs should be released.

Output: SYSOUT and Output Classes

Any output a batch program produces that is not written to a permanent dataset goes to a SYSOUT dataset, defined in the JCL with `DD SYSOUT=`. The character after `SYSOUT=` is the output class: an analogous routing mechanism to job classes, but for output rather than execution.

Output class `A` might route to the local printer. Class `B` might route to an email distribution system. Class `X` might hold output for manual review. The system programmer defines what each output class means in the JES2 configuration. A COBOL batch program writing to `DD SYSOUT=*` sends its output to the same class as the job itself.

SYSOUT data written during a job's execution accumulates on the spool. After the job finishes, output writers process the spool output: they read each SYSOUT dataset and deliver it to its destination (print, file, email, or hold). Until an output writer has processed the output and it has been released, the spool space it occupies remains allocated. Output management is therefore directly tied to spool capacity.

SDSF (System Display and Search Facility) is the primary tool for interacting with JES2 output. Operators and developers use SDSF to view job output, check job status, release held output, and purge completed jobs. SDSF runs as an ISPF application under TSO, presenting a panel-based interface to the JES2 spool.

JES2 in a Sysplex: Multi-Access Spool

In a Parallel Sysplex, multiple z/OS LPARs each run their own JES2 instance. By default each JES2 is independent. But for production environments that want batch jobs to execute on any available LPAR, JES2 supports a MAS (Multi-Access Spool) configuration: multiple JES2 members sharing the same spool datasets and job queue.

In a MAS, a job submitted on LPAR 1 can be executed by an initiator on LPAR 2 or LPAR 3, whichever has capacity. All members see the same queue, the same job status, and the same spool data. JES2 uses the checkpoint dataset (shared in a MAS) to coordinate state between members. The result is a single logical batch processing environment spanning multiple physical systems.

MAS is the batch equivalent of CICSPlex dynamic routing: work is distributed across the sysplex without the submitting application needing to know which system will run it. WLM is aware of the MAS and can distribute initiators across members based on each system's current load.

The Job Scheduler: TWS and CA7

JES2 handles the mechanics of job execution. It does not handle job scheduling in the business sense: deciding which jobs should run, in what order, at what times, and with what dependencies between them. That is the job of a batch scheduler.

The two dominant batch schedulers in the mainframe world are IBM Workload Scheduler for z/OS (formerly TWS, Tivoli Workload Scheduler) and Broadcom CA 7. Both work the same way at a high level: they hold a catalog of jobs and their dependencies, trigger jobs at scheduled times or when predecessor jobs complete, submit the JCL to JES2 via an internal reader, and monitor return codes to determine whether successor jobs should run.

A production batch schedule at a large bank might contain thousands of jobs with complex dependency chains. The nightly interest calculation cannot start until the transaction extract has finished. The statement generation cannot start until both the interest calculation and the fee calculation are complete. The final reconciliation cannot run until all update jobs have succeeded. The scheduler enforces these dependencies automatically, without any human intervention.

When a job in the chain fails (non-zero return code that exceeds the acceptable threshold), the scheduler holds all dependent jobs and raises an alert. An operator investigates, corrects the problem, and resubmits the failed job. The scheduler then releases the dependent jobs in the correct order. This is the operational model that runs overnight at virtually every bank, insurance company, and government agency that operates a mainframe.

How a Batch Job Moves Through the System

A scheduler submits a JCL job stream to JES2 via an internal reader at 11pm. JES2 assigns it job ID `JOB04217`, reads the JCL onto the spool, and runs the converter to check syntax and expand any cataloged procedures referenced in the JCL. If the conversion succeeds, the job is placed on the input queue with its class and priority.

Initiator 3, configured for class B, picks up the job. It reads the converted JCL from the spool, allocates the VSAM files specified in the first step's DD statements, and invokes the program named on the first `EXEC PGM=` statement. The COBOL program runs, reads its input file, processes records, writes updated records, and terminates with return code 0.

The initiator checks the `COND` parameter: return code 0 from step 1 means proceed. It allocates the second step's datasets and invokes the second program. This continues through each step. After the final step, the initiator returns the job to JES2 and marks it complete.

JES2 moves the job to the output queue. Output writers process each SYSOUT dataset: the job log goes to held output for the submitting user; the report file is written to a permanent DASD dataset; the summary report is sent to the class B output printer. JES2 notifies the scheduler via a return code event. The scheduler sees return code 0, marks the job successful, and releases its dependent jobs. The job is eventually purged from the spool. A type 30 SMF record is written with the job's resource consumption: CPU time, elapsed time, DASD I/O counts, storage used.

Failure Modes

The most common batch failure is a JCL error: a syntax mistake in the JCL that the converter detects before the job runs. The job gets a condition code of `JCL ERROR` and no steps execute. JCL errors are found in the job log output on the spool and must be corrected before resubmission.

An abend (abnormal end) occurs when a running program encounters an error it cannot handle: a program check, a dataset allocation failure, or an unhandled condition. The step terminates with a system abend code (e.g., `S0C7` for a data exception, `S322` for a CPU time limit exceeded, `S806` for a load module not found). Subsequent steps may be bypassed depending on the COND parameter. Abend diagnosis involves reading the job log and, for complex program failures, the system dump.

Spool full is an operational emergency. No new jobs can be submitted and existing jobs cannot write SYSOUT. The immediate remedy is to purge old output from the spool. The root cause is usually a backlog of unprocessed output, a job that wrote an abnormally large amount of SYSOUT, or a printer that has been down and not releasing its output queue.

A job stuck in execution happens when a program loops or hangs waiting for a resource. WLM's CPU time limits and the JCL `TIME=` parameter provide time-based cancellation. If a job runs past its time limit, JES2 issues a `S322` abend and terminates it. Operators can also cancel jobs manually with the `$C` JES2 command.

Summary

JES2 is the logistics engine of the mainframe. It accepts jobs from any source, stages them on the spool, routes them to the right initiators by class and priority, drives step-by-step execution through each program, collects output, and delivers it to its destination. The spool is the central shared storage that makes all of this parallel, asynchronous, and recoverable.

Batch processing is not a legacy concern on the mainframe: it is half the workload. Every bank's overnight interest run, every insurer's premium recalculation, every government benefit disbursement runs through JES2. The combination of JES2, a batch scheduler, WLM, and a MAS-configured sysplex gives production mainframe environments a batch processing capability that handles thousands of jobs per night with full dependency management, automatic failure recovery, and cross-system load distribution.

The next post covers JCL: the language used to define every job that JES2 processes.

Part of the Mainframe Decoded series — IBM Z and z/OS, clearly explained for engineers.