Training Module 19: Parallel Processing

The slides for the following video are shown below:

Slide 1

Welcome to the training course on IBM Scalable Architecture for Financial Reporting, or SAFR.  This is Module 19, Parallel Processing

Slide 2

Upon completion of this module, you should be able to:

  • Describe SAFR’s Parallel Processing feature
  • Read a Logic Table and Trace with multiple sources
  • Set Trace parameters to filter out unneeded rows
  • Configure shared extract files across parallel threads
  • Set thread governors to conserve system resources
  • Debug a SAFR execution with multiple sources

Slide 3

In a prior module we noted that the Extract Engine, GVBMR95, is a parallel processing engine, the Format Phase, GVBMR88, is not.  It is possible to run multiple Format Phase jobs in parallel, one processing each extract file.  Yet each job is independent of each other.  They share no resources, such as look-up tables for join processing.  In this module we will highlight the benefits of GVBMR95 multi-thread parallelism, how to enable, control, and debug it.

Slide 4

Parallelism allows for use of more resources to complete a task in a shorter period of time.  For example if the view requires reading 1 million records to produce the appropriate output and the computer can process 250,000 records a second, it will require 4 seconds at a minimum to produce this view.  If the file is divided into 4 parts then the output could be produced in 1 second.

Doing so requires adequate system resources, in the form of I/O Channels, CPUs, and memory.  If for example, our computer only has 1 CPU, then each file will effectively be processed serially, no matter how many files we have.  All parallel processing is resource constrained in some way.

Slide 5

GVBMR95 is a multi-task parallel processing engine.  It does not require multiple jobs to create parallelism.  Using the Logic Table and VDP, GVBMR95 actually creates individual programs from scratch and then asks the operating system to execute each program, called sub tasks, one for each input Event File to be scanned for the required views.  This is done all within one z/OS job step.  Each of these sub tasks independently reads data from the Event File, and writes extract records to extract files for all views reading that event file.

In this example, the combination of views contained in the VDP and Logic Table require reading data from four different event files.  These views write output data to six different extract files.  The main GVBMR95 program will generate four different sub tasks, corresponding to the four different input event files to be read.

Slide 6

Multi-threaded parallelism can provide certain benefits over Multi-job parallelism.  These benefits include:

  • Shared Memory.  Only one copy of the reference data to be joined to must be loaded into memory.  Sharing memory across different jobs is much less efficient than within a single job.  Thus all sub tasks can efficiently access the memory resident tables
  • Shared I/O.  Data from multiple input files can be written to a single extract file for that view, allowing for shared output files
  • Shared Control.  Only one job needs to be controlled and monitored, since all work occurs under this single job step
  • Shared Processing.  In certain cases, data may be shared between threads when required, through the use of SAFR Piping or exit processing

Slide 7

Recall from an earlier lesson that:

  • A physical file definition describes a data source like the Order Files
  • A logical file definition describes a collection of one or more physical files
  • Views specify which Logical Files to read

The degree of potential parallelism is determined by the number of Physical Files.  GVBMR95 generates subtasks to read each Physical File included in every Logical File read by any View included in the VDP.

Slide 8

The SAFR metadata component Physical File contains the DD Name to be used to read the input event file.  This DD Name is used throughout the GVBMR95 trace process to identify the thread being processed.

Slide 9

Remember that GVBMR95 can also perform dynamic allocation of input files.  This means that even if a file is NOT listed in the JCL, if the file name is listed in the SAFR Physical File entity, GVBMR95 will instruct the operating system to allocate this named file.  GVBMR95 will first test if the file is listed in the JCL before performing dynamic allocation.  Thus the SAFR Physical Entity file name can be over-ridden in the JCL if required.  Dynamic allocation can allow an unexpected view included in the VDP to successfully run even though no updates were made to the JCL. 

Slide 10

The same Physical File can be included in multiple logical files.  This allows the SAFR developer to embed in file’s meaning that may be useful for view processing, like a partition for stores within each state.  Another Logical File can be created which includes all states in a particular region, and another to include all states in the US.

In this module’s example, we have the ORDER001 Logical File reading the Order_001 physical file, the ORDER005 Logical File reading the ORDER-005 physical File, and the ORDERALL Logical File, reading the ORDER001, 002, 003, 004 and 005 physical files.

Slide 11

In this module we have three sample views, each Extract Only Views with the same column definitions (3 DT columns).  The only difference is the Logical File each is reading.  View 148 reads only the ORDER 001 file, while view 147 reads only the ORDER 005 file, and view 19 reads all Order files, including 1, 2, 3, 4 and 5.

Slide 12

In the first run, each view in our example writes its output to its own extract file.  Output files are selected in the view properties tab.  The SAFR Physical File meta data component lists the output DD Name for each, similar to the Input Event Files. 

Slide 13

As GVBMR90 builds the logic table, it copies each view into the “ES Set” which reads a unique combination of source files.  Doing so creates the template needed by GVBMR95 to generate the individualized program needed to read each source. Note that Output files are typically a single file for an entire view, and thus typically are shared across ES sets.

In our example, three ES Sets are created.  The first includes views 148 and 19, each with their NV, DTEs, and WRDT functions.  The second ES Set is only view 19.  And the third is views 19 and 147.  The Output04 file will be written to by multiple ES Sets, which contain view 19.

Slide 14

This is the generated logic table for our sample views.  It contains only one HD Header Function, and one EN End function at the end.  But unlike any of the pervious Logic Tables, it has multiple RENX Read Next functions, and ES End of Set Functions.  Each RENX is preceded by a File ID row, with a generated File ID for that unique combination of files which are to be read.  Between the RENX and the ES is the logic for each view; only the Goto True and False rows differ for the same logic table functions between each ES set.

In our example, begins with the HD function, then ES Set 1, reading File “22” (a generated file ID for the ORDER001 file) contains view logic 48 and 19.  The second ES Set for File “23” (Order files 2, 3, and 4) contains only View ID 19, and the third set for file ID “24” (the ORDER005 file) contains the view logic for view 47 and 19 and ends with the EN function.

Next we’ll look at the initial messages for the Logic Table Trace.

Slide 15

When Trace is activated, GVBMR95 prints certain messages during initialization, before parallel processing begins.  Let’s examine these messages.

  • MR95 validates any input parameters listed in the JCL for proper keywords and values
  • MR95 loads the VDP file from disk to memory
  • MR95 next loads the Logic Table from disk to memory
  • MR95 clones ES Sets to create threads, discussed more fully on the next slide
  • MR95 loads the REH Reference Data Header file into memory, and from this table then loads each RED Reference Data file into memory.  During this process it check for proper sort order of each key in each reference table
  • Using each function code in the logic table, MR95 creates a customized machine code program in memory.  The data in this section can assist SAFR support in locating in memory the code generated for specific LT Functions
  • Next MR95 opens each of the output files to be used.  Opening of input Event files is done within each thread, but because threads can share output files, they are opened before beginning parallel processing
  • MR95 updates various addresses in the generated machine code in memory
  • Having loaded all necessary input and reference data, generated the machine code for each thread, and opened output files, MR95 then begins parallel processing.  It does this by instructing the operating system to execute each generated program.  The main MR95 program (sometimes called the “Mother Task”) then goes to sleep until all the subtasks are completed.  At this point, if no errors have occurred in any thread, the mother task wakes up, closes all extract files, and prints the control report.

Slide 16

The GVBMR90 logic table only includes ES Sets, not individual threads.  When GVBMR95 begins, it detects if multiple files must be read by a single ES Set.  If so, it clones the ES Set logic to create multiple threads from the same section of the logic table.

In our example views, the single ES Set for the Order 2, 3 and 4 files is cloned during MR95 initialization to create individual threads for each input file.

Slide 17

During parallel processing GVBMR95 prints certain messages to the JES Message Log

  • The first message shows the number of threads started, and the time parallel processing began
  • As each thread finishes processing, it also prints a message showing the time it completed
  • The thread number completed
  • The DD Name of the input file being read, and
  • The record count of the input file records read for that thread

Each thread is independent (asynchronous) with any other thread (and the sleeping Mother Task during thread processing).  The order and length of work each performs is under the control of the operating system.  A thread may process one or many input event records in bursts, and then be swapped off the CPU to await some system resource or higher priority work.  Thus the timing of starting and stopping for each thread cannot be predicted.

Slide 18

The Trace output is a shared output for all processing threads.  Because threads process independently, under the control of the operating system, the order of the records written to the trace is unpredictable.  There may be long bursts of trace records from one thread, followed by a long burst of processing within another thread.  This would be the case if only two threads were being processed (for two input event files) on a one CPU machine (there would be no parallel processing in this case either)  How long each thread remains on the CPU is determined by the operating system. 

Alternatively, records from two or more threads may be interleaved in the trace output, as they are processed by different CPUs.  Thus the EVENT DDNAME column become critical to determining which thread a specific trace record is for.  The DD Name value is typically the input Event file DD name.  The Event Record count (the next column) always increases for a particular thread.  Thus record 3 is always processed after record 2 for one specific DD Name.

In this example,

  • The ORDER002 thread begins processing, and processes input event records 1 – 13 for view 19 before any other thread processes. 
  • It stops processing for a moment, and ORDER001 begins processing, but only completes 1 input event record for views 148 and view 19 before stopping to process. 
  • ORDER002 picks back up and processes record 14 for its one view. 
  • ORDER005 finally begins processing, completing input record 1 for both views 147 and 19. 
  • ORDER001 begins processing again, record 2 for both views 148 and 19.

Note that this portion of the trace does not show any processing for thread ORDER003 and ORDER004.

Slide 19

In a prior module we showed examples of Trace parameters which control what trace outputs are written to the trace file.  One of the most useful is the TRACEINPUT parameter, which allows tracing a specific thread or threads.  It uses the thread DD Name to restrict output to that thread.

Slide 20

The heading section of the GVBMR95 control report provides information about the GVBMR95 environment and execution.  It includes:

  • SAFR executable version information
  • The system date and time of the execution of the SAFR process.  (This is not the “Run Date” parameter, but the actual system date and time)
  • The GVBMR95 parameters provided for this execution, including overall parameters, environment variables, and trace parameters
  • zIIP processing status information

Slide 21

The remainder of the control report shows additional information when running in parallel mode:

  • The number of threads executed, 5 in this example, is shown
  • The number of views run is not simply a count of the views in the VDP, but the replicated ES Set views.  In this example, because view 19 read 5 sources, it counts as 5, plus view 147 and view 148 equals 7
  • The greater the number of views and threads that are run, the larger the size of the generated machine code programs,  in bytes. 
  • The input record count for each thread is shown, along with it’s DD name.  Because GVBMR95 ran in full parallel mode, the record counts for the thread are the same as the record counts for each input file.  Later we’ll explain how these numbers can be different.
  • The results of each view against each input DD Name (thread in this case) is shown, including lookups found and not found, records extracted, DD name where those records were written to, and the number of records read from that DD Name.  Certain additional features, explain in a later module, can cause the record counts read for a view to be different than the record counts for the input file.

Slide 22

Because SAFR allows sharing output files across threads, it is possible to have many views writing to a single output file.  The outputs may either be all the same record formats, or differing formats as variable length records.  This can allow for complex view logic to be broken up multiple views, each containing a portion of the logic for a particular type of record output.

Slide 23

This control report shows the results of changing our views to write their outputs to one file.  The total record count of all records written remains the same.  They are now consolidated into one single file, rather than 3 files.

Similar to the Trace output, the order of the records written to the extract file by two or more threads is unpredictable. 

As we’ll discuss in a later module, there is a performance impact for multiple threads writing to a single output file.  Threads ready to write a record may have to wait, and the coordination of which thread should wait consumes resources.  Therefore it can be more efficient to limit the amount of sharing of output files when possible, and combine files (through concatenation, etc.) after SAFR processing.

Slide 24

In this execution, we have included additional views, reading the same input event files, but which produce Format Phase views.  The rows showed in red are added to the prior control report as additional outputs from the single scan of the input event files.  This also demonstrates how the standard extract files can be shared across multiple threads.

Slide 25

SAFR provides the ability to control the overall number of threads executed in parallel without changing the views.  The GVBMR95 parameters Disk and Tape Threads specify how many parallel threads GVBRM95 should execute in parallel.  Disk threads control the number of threads reading input Event Files on disk, as specified in the SAFR Physical File meta data; tape threads control those input Event Files which are on tape. 

The default value for these parameters is 999 meaning effectively all threads should be executed in parallel.  If one of the parameters is set to a number less than the number of input Event Files of a specific type, SAFR will generate only that many threads.  GVBMR95 will then process multiple event files serially within one of those threads.  As one event file completes processing, that thread will then examine if more event files remain to be processed and if so, it will process another Event file under the same thread.

Slide 26

The GVBMR95 Control reports show the results of thread governor.  In the top, the disk governor parameters are shown, and the number of records read by each thread are shown.  Because each thread only processed one event file, and all were done in parallel, the records read from the event file and for the thread are the same.

In the bottom example, the thread governor has been set to 1, meaning only one thread will be run.  In this instance the control report shows that the total records read for the thread equal the total reads read for all event files because this thread processed each event file serially until all threads were complete. 

Slide 27

The above highlight the key z/OS statistics often tracked from GVBMR95 runs.  These include CPU time, elapsed time, memory usage, and IO counts.

The top set of statistics are from the run with parallelism, the bottom from a run using the Thread Governor with no parallelism.  The impact of parallelism can be seen in the difference between the elapsed time for a job with parallelism and one without. Parallelism cut the elapsed time in half. 

The Extract Program, GVBMR95, is subject to all standard z/OS control features, including job classes, workload class, priority, etc.  In other words, even with parallel processing, the z/OS job class or priority may be set to such a low level, and the other workloads in higher classes on the machine may be so significant that effectively no parallel processing occurs even though GVBMR95 instructs z/OS to execute multiple threads.  Each of these threads receive the same job class, priority, etc. as the entire Extract Phase job.  In this way the operator and system program remain in control of all SAFR jobs.

Slide 28

This module described SAFR Parallel Processing. Now that you have completed this module, you should be able to:

  • Describe SAFR’s Parallel Processing feature
  • Read a Logic Table and Trace with multiple sources
  • Set Trace parameters to filter out unneeded rows
  • Configure shared extract files across parallel threads
  • Set thread governors to conserve system resources
  • Debug a SAFR execution with multiple sources

Slide 29

Additional information about SAFR is available at the web addresses shown here. This concludes Module 19, Parallel Processing

Slide 30

This page does not have audio.

Practical s/390 Parallel Processing Option for Data Warehouses: 1995

by Richard K. Roth, Principal Price Waterhouse LLP, Sacramento, California

November 8, 1995

[More about this time can be read in Balancing Act: A Practical Approach to Business Event Based Insights, Chapter 20. Parallelism and Platform.]

Why Parallel Processing?

Data warehouse applications differ in a basic way from traditional transaction processing and reporting applications:

  • Transaction processing applications usually process a day’s or a month’s activity, then archive the detail and clear out the files to prepare for the next day’s or month’s activity.  Reporting functions for these applications are primarily limited to supporting the operational and clerical aspects of the narrow functional area addressed by the application for just the days or months that are open.  Even for “high” volume applications (e.g., call detail billing) the volume never really gets “large” because of the archiving process.
  • Data warehouse applications tend to deal with “large” volumes of data since their primary purpose is to perform a retrospective analysis of the cumulative effects of transaction processing over time.  Some uses for data warehouses have similar characteristics to transaction processing applications in that specific requests may only require access to small amounts of data
    (e.g., an individual customer’s profile and particular month of call detail activity).  However, most analytical uses of data warehouse information require scanning large partitions of the warehouse to identify events, or customers, or market segments that meet particular criteria.   It is in the scanning kinds of uses where the basic architecture for building transaction processing systems breaks down when applied to data warehouse applications.

The basic architecture for most transaction processing applications relies on indexed access to a few records in large files for purposes of validation or update (DB2 primarily is an indexed access method).  The advantage is that if  you only need to access or update a few records, and you can find the records needed through an index, a lot of time can be saved by touching only the records required rather than searching through all the records just to find the few that you want.  In general, indexed access works at about 200,000 bytes per second (about an hour and a half per billion).  So, as long as the transaction processing or reporting involved only requires that single billions of bytes be touched in the process once a processing cycle, indexed access is sufficiently fast for the purpose.

Data warehouses, however, can have tens, hundreds, or thousands of billions of bytes (gigabytes).  In addition, the more detailed the data warehouse, the more difficult (impractical) it is to keep indexes on all the fields that might be needed to find the records.  Consequently, using indexed access methods to process data warehouses could take hours or days, depending on size and arrival rate of processing requests.  And the practical effect in many cases is that, under the covers, the indexed access method ends up doing a sequential scan of the files.

Sequential access methods are the basic alternative to indexed access methods.  The advantage of sequential access is that it is very fast (1-20 million bytes/second depending on hardware and software configuration) compared to indexed access.  Since much of the processing that goes on with a data warehouse is sequential anyway, it makes sense to take advantage of this speed where possible.  Further, if the processing can be organized so that multiple sequential table scans can be initiated in parallel[1] rather than serially, enormous amounts of data can be scanned in a relatively short period of time (at 4 million bytes/second, a gigabyte can be read in about 4 minutes; if 40 of these sequential threads were started in parallel, half a trillion bytes could be read in an hour).  The mathematics of parallelism really add up for data warehouses.

What are the practical alternatives for implementing parallel processing applications?

There are two basic approaches to achieving parallelism in processing:

  • Specialized hardware architectures (i.e., massively parallel processing “MPP” computers) can be used that initiate parallel scans on separately paired CPU and disk assemblies (known as “shared nothing environments”).  Teradata and Tandem are examples of two companies that manufacture computers in this class.
  • Software can be used to initiate multiple parallel scans on computers that share a common pool of disk and tape devices (known in two groups as mainframes and symmetric multi-processor “SMP” machines).  IBM, AMDAHL, and Hitachi are companies that manufacture mainframes.  Hewlett Packard, Silicon Graphics, and Sequent are examples of companies that manufacture SMP machines.

Until recently, the MPP hardware approach was almost always the most cost-efficient way to achieve parallelism for data warehouse processing.  MPP machines are manufactured using CMOS processor technology and work with disk drives that historically have been less costly than a corresponding mainframe configuration.  In addition, the only software available for the mainframe that really exploited its parallel processing, multi-tasking, high data bandwidth capabilities were CICS, DB2 and JES, none of which are particularly useful for initiating multiple parallel sequential table scans.  Implementing the software alternative on a mainframe platform required developing custom systems-level software for the purpose, which some large organizations facing this problem have done.  SMP machines continue to be designed primarily for on-line processing and characteristically demonstrate very low data bandwidth in practice, which makes them generally unsuitable for any large application of this type.

MPP hardware is a workable solution, but it also presents some technical and economic challenges.  In many cases, the high volume data that will populate the warehouse already is resident on the mainframes where it was captured in the first place.  Duplicate copies of all event, reference, and code-set files must be maintained in both environ-ments.  Interactive applications often have to be maintained in both environments to access the same basic data.  Businesses with the fixed cost of a large mainframe infrastructure behind them are reluctant to take on the burden of a second environment, including new skills require-ments.  And it costs millions of dollars to buy capacity sufficient for making batch processing windows, which for the most part sits idle because of the limited purpose nature of the technology.

Over the last several years, however, mainframe CPU, disk, tape, and software prices have come down dramatically.  In the early 1990s, mainframe CPU power cost over $100,000 per MIPS (million instructions per second).  Today, mainframes are being made using the cheaper CMOS technology, and mainframe CPUs can be purchased in the $14,000 per MIP range.  High performance tape and disk devices are available at comparable prices regardless of platform.

Price Waterhouse also has developed parallel processing software that permits parallel sequential table scans to be initiated in mainframe data warehouse applications without requiring systems programmer intervention.  This software is known as Geneva V/T  and generally offers 10-50 fold CPU consumption and throughput gains over baseline COBOL and DB2 techniques.  The combination of a more than five-fold reduction in price of the basic hardware and a ten-plus factor increase in throughput and compute efficiency from the Geneva V/T  parallel software has turned around the basic cost/performance calculations in favor of the mainframe for these types of data warehouse applications.

How can you tell which approach is best for a given situation?

Evaluating a mainframe software approach first for exploiting parallel processing offers some substantial advantages.  Geneva V/T  can support processing from tape as well as disk, which means that historical event data can be accessed directly from existing datasets without the need to buy additional hardware just to get started.  This permits conducting a mainframe proof-of-concept demonstration to verify that the data being considered for warehousing truly is valuable for resolving pressing problems and to prove the processing metrics for the case at hand in the existing production environment.

Assuming that access to the data proves to be truly valuable, use in an operational context may necessitate the purchase or allocation of additional compute and disk/tape resources.  But, because new resources simply will be added to the existing SYSPLEX configuration in most cases, peak load requirements probably can be accommodated through scheduling rather than outright purchase of excess capacity necessary to get through the humps.  This most likely will result in a substantially lower overall net hardware purchase compared with what would be required if an MPP hardware approach were adopted.  In addition, during periods when high volume data warehouse files are not being processed, whatever additional hardware is purchased is available for other applications experiencing peak-load demands.

What does a top-down picture of this mainframe environment look like?

An graphic view of what we refer to as the MVS Operational Data Store Architecture is provided in the diagram below:

DB2 provides a complete relational database environment for transaction processing applications and intermediate volume on-line and batch extract/reporting processes.  Where high volume parallel processing techniques are appropriate, Geneva V/T  provides the mechanism for accessing on-line DB2 and VSAM files in place and sequential files on either disk or tape.  Where interactive queries are required, DB2 provides the framework for interactive execution.  Where volume problems dictate minimizing the number of times datasets are accessed, Geneva V/T  provides for a generalized single pass architecture where multiple queries are resolved simultaneously in one pass of the data.

The Geneva V/T  disk or tape option is especially important since new tape technology is now available with 30 gigabyte (compressed) capacity per volume and I/O transfer rates in the 20 megabytes per second range (most high performance disk transfer rates are in the 6 megabytes per second range).  This means that for high volume event files where sequential processing is the preferred method, dramatically cheaper tape becomes the preferred media for storage, not just a compromise forced by economics.  This can offer significant potential for dramatic savings over an MPP hardware approach that would dictate DASD-based secondary storage.

Recently released new features of MVS also make the mainframe environment the most attractive choice as the backbone server for a client/server network.  Especially important facilities are the LANRES and LANSERVER MVS functions.  Files generated by selection functions, either directly from DB2 or Geneva V/T, can be stored in a common workstation-oriented file repository on mainframe attached DASD or automatically downloaded to designated servers on the WAN/LAN.  EBCDIC/ASCII translations are performed automatically and NetWare users can access mainframe storage devices just like any other NetWare disk volume.  Options for accessing mainframe files through the Internet are even available.

*          *          *

In summary, the mainframe world has changed.  Instead of being something to run away from because of cost, absence of software, or lack of connectivity and GUI options, it now is in principle the server platform of choice.  No doubt there will be many situations where other server or specialty processor platform choices will be appropriate.  But, instead of “getting off the mainframe” being the going in position, “exploiting the already existing mainframe infrastructure” should be the common sense point of departure when evaluating technical platform alternatives for parallel processing problems.

[1]DB2 does take advantage of some parallel processing techniques under certain circumstances.

Original Image

The following are images of the original paper.