Analytics Archives - GenevaERS

Consolidated Data Supply Chain: 2016

By Business, Event Based Principles, Technical

The following slides were developed while consulting at 12 of top 25 world banks about financial systems from 2013-2016. Although not specific to GenevaERS, it delineates the causes of the problems in our financial systems, and the potential impact of GenevaERS supported systems and the Event Based Concepts.

Additional details can be read in this white paper: A Proposed Approach to Minimum Costs for Financial System Data Maintenance, and further thinking on the topic in this blog and video presentation from 2020: Sharealedger.org: My Report to Dr. Bill McCarthy on REA Progress.

SAFR and A Smarter Planet for Financial Reporting: 2011

By Business, Event Based Principles, History

The following was written as part of a renewed sales initiative for SAFR, building upon the latest financial ledger projects.

This write-up written for IBM sales efforts, was used as the summary of Balancing Act: A Practical Approach to Business Event Based Insights, Chapter 2: The Problem and Chapter 3: The Solution.

The Problem

The recent financial crisis has exposed the systemic problem that the world’s largest financial institutions cannot adequately account for and report on liquidity, positions, currency exposure, credit, market, and interest rate risk and product, customer and organizational performance.  The CFO plays a critical role in correcting this problem by leveraging the financial data they already control, as well as leveraging scale to take out cost.  But even industry insiders do not realize that financial institutions suffer a unique set of domain problems when it comes to financial reporting.

Current financial reporting systems are antiquated and very inefficient. They were designed decades ago to simply track flow of capital, revenue and expenses at the company and departments levels.  The lack of transparency is evident in the increasing costs of the finance function with few benefits to show for the investment.  Sarbanes Oxley and other regulations have proven ineffective at getting at the root of the problem and the resulting financial meltdown regulations may well prove similarly ineffective.  These pressures create diseconomies of scale which affect the largest institutions the most.

For the most part, existing systems deliver accurate results in summary, but the increase in transparency requires line of site to the underlying causes of those results. Consider if your personal bank account statement or credit card bill only presented the change in balance between periods, but provided no list of transactions. When the statement is as expected, further detail may not be needed. But when the balance is in question, your first response is ‘why’ and you immediately want to see the transaction detail.  The same issues are at stake when managing the finances of the enterprise – with the associated cost and consequences considerably higher!  A single instance of financial restatement has cost some organizations hundreds of millions of dollars to correct, not counting lost market valuation.

Currently 90% of the money supply in mature markets is represented by digital records of transactions and not hard currency. It’s no wonder that that the volume of electronic finance records being kept has exploded compared to when the systems were first created. Yet our approach to these demands has not been to automate the process of keeping and accessing the details of the transactions. Almost all employees in today’s financial institutions are involved in capturing and coding financial details in some way, and a large number of non-finance employees are involved in the investigative process to locate the additional detail so often required. The effort for this manual intervention is incredibly inefficient and costly.

As we see all around us, computing capacities have increased by several orders of magnitude since these finance systems were designed. However, reporting processes have grown organically as a system of transaction summaries in order to continue to bridge multiple financial systems – but have lacked a single unified approach. This has meant that for the most part the business of financial reporting has not benefited from the increase of computing capacities available today.

The Solution

A Smarter Planet is founded on financial markets that provide for greater transparency and comprehension of the financial reporting by bank and non-bank entities, allowing the markets to react to conditions in more informed, less reactionary ways. IBM has spent 25 years refining an approach to this for financial institutions. The IBM® Scalable Architecture for Financial Reporting™ (SAFR) system provides financial reporting that is built bottoms up from the individual business event transactions to provide the finest grained views imaginable.

By harnessing today’s computing power and straight through processing approach, the details behind summary data can be made available in seconds rather than days or weeks. Providing nearly instant access to the highest quality financial data at any level of granularity will eliminate the duplicative reporting systems which tend to capture and produce summaries of the same business events for many stakeholders and reporting requirements.

More importantly, it will automate the hidden work of armies of people who are required to investigate details and attempt to explain results, or attempt to reconcile the disparate result of these many reporting systems—a truly wasteful activity caused by the systems themselves. Keeping the details in a finance system that can serve these needs allows for increased control, quality and integrity of audit efforts rather than dissipating them.

Some may question how much detail is the right level of detail?  Others may suggest this is too radical a change in a mature, understood and tested set of systems.  IBM experience with some of the largest financial services companies suggests that building a finance system.  based on the requirement to instrument the most granular level of transaction detail immediately stems the tide of increasing costs, lowers a variety of risks and can be a key driver of transformation of the banks ability to become more agile.  In time this approach begins to provide economies of scale for reporting.

SAFR is: (1) an information and reporting systems theory, (2) refined by 25 years of practical experience in creating solutions for a select group of the world’s largest businesses, (3) distilled into a distinctive method to unlock the information captured in business events, (4) through the use of powerful, scalable software for the largest organization’s needs, (5) in a configurable solution addressing today’s transparency demands.

The Theory

Companies expend huge sums of money to capture business events in information systems.  Business events are the stuff of all reporting processes.  Yet executives report feeling like they are floating in rafts, crying “Data, data everywhere and no useful information.”  Focusing reporting systems on exposing business events combinations can turn data into information.

The Experience

Although analysis of business events holds the answers to business questions, they aren’t to be trifled with, particularly for the largest organizations.  Reporting processes—particularly financial reporting processes—accumulate millions and billions of business events.  In fact, the balance sheet is an accumulation of all the financial business events from the beginning of the company!  Such volumes mean unlocking the information embedded in business events requires fundamentally different approaches.  The 25 years of experience of building SAFR in services engagements has exposed, principle by principle, piece by piece, and layer by layer the only viable way.

The Method

This experience has been captured in a method of finding and exposing business events, within the context of the existing critical reporting processes.  It uses today’s recognized financial data like a compass pointing north to constrain, inform, and guide identification of additional reporting details.  It facilitates definition of the most important questions to be answered, and to configuring repositories to provide those answers consistently.  It also explains how to gradually turn on the system without endangering existing critical reporting processes.

The Software

The infrastructure software, a hard asset with hundreds of thousands of lines of source code and feature set rivaling some of the best known commercial software packages, is most often what is thought of when someone refers to SAFR. 

The Scan Engine is the heart of SAFR, performing in minutes what other tools require hours and days to do.  The Scan Engine is a parallel processing engine, generating IBM z/OS machine code.  In one pass through a business event repository it creates many business event “views,” providing rich understanding.  It categorizes, through join processes, the business events orders of magnitude more efficiently than other tools.  Built for business event analysis, it consistently achieves a throughput of a million records a minute.  It is highly extensible to complex problems. 

SAFR Views are defined in the SAFR Developer Workbench or rule based processes in the SAFR Analyst Workbench or in custom developed applications.  The Scan Engine executed as a scheduled process, scans the SAFR View and Metadata Repository selecting views to be resolved at that time.

The Indexed Engine, a new SAFR component, provides one at a time View resolution through on-line access to Scan Engine and other outputs.  It uses Scan Engine performance techniques.  Reports structure and layout are dynamically defined in the Analyst Workbench.  The Indexed Engine creates reports in a fraction of the time required for other tools.  Its unique capabilities allow for a movement based data store, dramatically reducing data volumes required both in processing and to fulfill report request.

Upon entering Managed Insights, users select parameters to drill down to increasing levels of business events, and perform multidimensional analysis through the Viewpoint Interfaces.  The Insight Viewer enables discovery of business event meaning in an iterative development mode.

The Solution

The SAFR Infrastructure Software has been configured over 10 years for number of clients to provide an incredibly scalable Financial Management Solution (FMS) for the largest financial services organizations. 

The heart of FMS is the Arrangement Ledger (AL).  An “arrangement” is a specific customer/contract relationship.  The AL, a customer/contract sub-ledger, maintains millions of individual customer/contract level balance sheet and income statements.  This incredibly rich operational reporting system supports a nearly unbelievable swath of information provided by scores of legacy reporting systems in summary, with the added benefit of being able to drill down to business event details if needed.  Doing so allows reporting high quality financial numbers by customer, product, risk, counterparty and other characteristics, all reconciled, audited, and controlled.  

AL is fed daily business events typically beginning with legacy general ledger entries and then transitioning to detailed product systems feeds over time.  The business events become complete journal entries at the customer-contract level, including reflecting the impact of time in the Accounting Rules Engine.  Rules are under control of finance rather than embedded in programs in source systems, enabling Finance to react to changes in financial reporting standards, including International Financial Reporting Standards (IFRS).

The business event journal entries are posted by the Arrangement Ledger on a daily basis, while it simultaneously generates additional point in time journal entries based upon existing balances, including those for multi-currency intercompany eliminations, GAAP reclassification and year-end close processing.  It accepts and properly posts back-dated entries to avoid stranded balances, and summarizes daily activity to pass to the General Ledger.  The General Ledger provides another control point for the traditional accounting view of the data.  The Arrangement Ledger detects and performs reclassification keeping the arrangement detail aligned with the summary General Ledger

AL also accepts arrangement descriptive information with hundreds of additional attributes to describe each customer-contract, and counterparty or collateral descriptive attributes, enabling producing trial balances by a nearly unlimited set of attributes, not just the traditional accounting code block.  Extract processes produces various summaries, perhaps ultimately numbering in the hundreds or thousands, to support information delivery for not only traditional accounting but also statutory, regulatory, management, and risk reporting.  The SAFR one pass multiple view capability allows AL to load data, generate new business events, and create extracts all in one process, including loading the incredibly information rich Financial Data Store. 

Information Delivery includes multiple ways of accessing the Arrangement Ledger and Financial Data Store.  The major window is through SAFR Managed Insights.  This parameter-driven Java application provides thousands of different permutations of the data.  It allows drill-down from summaries to lower and lower levels of data without impacting on-line response time.  It allows dynamic creation of new reports and multi-dimensional analysis of Financial Data Store data.  Extract facilities provide the ability to feed other applications with rules maintained by finance.  Other reports provide automated reconciliation and audit trails.

FMS can be tailored to work within an existing environment, including working within the existing security and reference data frameworks.  FMS is often can be a sub-component of an ERP implementation.


This is a financial system architecture for the 21st century. This is the reporting system architecture for the 21st century. Finance transformation starts with finance systems transformation.  Finance systems transformation starts with rejecting the legacy finance systems architecture that provides only summary results.  It is transforming the financial systems—the original enterprise data warehouse—into a system capable of supporting today’s information demands.


Copyright ©2010, 2011, 2015, 2018 by Kip M. Twitchell.

All rights reserved. No part of this book may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopying, recording, or by any information storage and retrieval system, without permission in writing from the author except for the use of brief quotations in a book review or scholarly journal. Posted by permission.

Digineer Solution Blueprint: 2002

By Business, History, Technical

The following are two solution blueprints by an additional attempted start-up to license Geneva. In this instance, the platform was to be ported to the Intel Itanium processor, putting a very powerful data analysis tool right under the desk of the analyst.

The vision was far sighted, anticipating what happened with cloud and data analytics. But it was ravaged by the technology collapse post Internet bubble.

More about this time can be read in Balancing Act: A Practical Approach to Business Event Based Insights, Chapter 53. Spin-offs and Chapter 1: Introduction


Business Challenge:

Consolidating massive amounts of data from multiple sources to gain an integrated view

Technology Solution:

  • OI5 DataEngine™ from Digineer™
  • Enterprise Hardware Platform
  • Scaleable Intel® Architecture servers

Advanced Transaction Analysis for Business and Government

Crucial data that can help businesses and government operate more efficiently often lies buried within an overwhelming mass of transaction details scattered among numerous computers. Extracting this data using most data mining solutions presents many difficulties, since these solutions lack the flexibility, scope, and sheer processing power necessary to convert uncorrelated data into useful information. The Intelligence Engine solution combines technology expertise from PricewaterhouseCoopers (PwC Consulting), Digineer™, and Intel to help solve this issue. The Intelligence Engine solution spans data repositories across the world, using query-processing techniques to generate data intelligence from large numbers of events and transactions.


Without the ability to look deep into operational processes, large-scale businesses are hindered in their decision-making and organizational goals––essentially running blind. Where data exists, it often resides on disparate servers in a variety of formats, resisting easy access or useful analysis. Collecting, consolidating, and correlating this data requires tools that can be scaled for massive volumes of data, and adapted easily to expose a wide range of information. Challenges to this end include:

  • Huge volumes of transaction data––Many conventional data warehousing and data mining solutions collapse under the weight of millions of daily event-level transactions. A robust solution must marry intelligent query processing with sufficient processing power to meet high levels of activity.
  • Data scattered throughout many repositories––Large organizations frequently store data in a variety of formats in data repositories spread throughout the country––or the world. Finding the key data—the proverbial needle in a haystack—demands a solution scaled to the magnitude of the task.
  • Need to quickly spot trends and patterns––The accelerating pace of modern business makes it necessary to detect and respond to trends and patterns in days rather than weeks. Knowledge-discovery tools must be flexible and adaptable for this purpose.
  • Evolving data requirements––Creating data models and report formats can be a very time-consuming and expensive proposition, and updating them can also require excessive time and effort. Tools to reduce this design effort and the associated maintenance help ensure that changing data requirements can be accommodated in minimal time.


Being able to accurately detect emerging patterns and trends in daily operations is invaluable to any organization seeking to remain competitive under the pressures of rapid change. The Intelligence Engine solution incorporates Digineer™ OI5 DataEngine™ deployed on Intel® Itanium® processor family-based servers to provide fast and flexible analytical data handling across multiple systems. By providing consolidated views of dispersed data in a variety of formats, this solution lets large organizations make fully informed decisions within their industry by assimilating data from every group or department––regardless of platform. The OI5 DataEngine™ successfully bridges dissimilar data structures and geographically separated servers to turn raw data into useful knowledge, providing a robust solution to help guide leaders of large organizations.

To accelerate the process of collecting and consolidating data, Digineer’s toolset simplifies query-processing setup and report generation. PricewaterhouseCoopers has successfully deployed this solution in a wide range of industries, including retail, financial, healthcare, and government.


The Intelligence Engine solution lets any large organization benefit from improved visibility into the heavy volume of daily transactions and events specific to their business––whether retail sales tallies per product for a nationwide distributor, the weather conditions compared to yields at a juice producer’s orange orchards, or quality control data tracked by a steel manufacturer looking for ways to improve processes.

Clients adopting this solution enjoy improved operations management through rapid analysis of transaction data. They also gain deeper insights into key processes within their organization that may have been inaccessible in the past because of the difficulty in collecting and correlating the relevant data.


Supports multiple data formats: The OI5 Data Engine™ provides support for a wide range of data formats, allowing flexible handling of data stored in repositories across many different platforms.

Designed for faster, high-volume transaction processing: The Intelligence Engine solution combines the intelligent pre-processing capabilities of the OI5 DataEngine™ with the architecture benefits of the Intel® Itanium® processor family. The result is a solution that scales well for the extremely large volumes of data involved in transaction processing––producing rapid responses to complex data queries.

Adapts easily to changing data requirements: The solution includes tools that simplify query design and report generation, encouraging companies to create new models for data collection and analysis, while meeting the core data extraction requirements for the organization.

Advanced Business Intelligence and Analytics Solution

Retailers, financial institutions, healthcare companies, and government agencies are collecting more data than ever before. This includes sales data from stores, warehouses, e-Commerce outlets, and catalogs, as well as clickstream data from Web sites. In addition, there is data gathered from back office sources––merchandising, buying, accounting, marketing, order processing, and much more. To pull together information from all these different sources and integrate, process, and analyze it, companies require advanced business intelligence and analytical solutions powered by a robust, high-performance infrastructure.

The Intelligence Engine solution addresses this very need, by combining technological expertise from PricewaterhouseCoopers (PwC Consulting), Digineer™, and Intel®. The query-processing capabilities of the Digineer™ OI5 DataEngine™ combined with the processing power of the Intel® Itanium® processor family, enables PwC Consulting to provide companies with the ability to quickly consolidate and integrate massive amounts of data from multiple sources––giving key decision makers a single view of vital business information.

Optimized to run on Intel® Itanium® processor family-based platforms, the Digineer™ OI5 DataEngine™ delivers a reliable, high-performance mechanism for processing and querying vast quantities of raw, event-level data through a software algorithm that employs intelligent pre-processing for improved efficiency. The Explicitly Parallel Instruction Computing (EPIC) technology and 64-bit addressing capabilities found in Intel® Itanium® architecture, deliver new levels of performance when applied to data mining and knowledge discovery operations. Powerful Intel® Itanium® processor familybased servers provide the massive computing power needed to run and consolidate memory-intensive databases across channels. In addition, integrating different processes and organizations is easier and more cost effective with a modular, open environment based on industry standards and Intel® Architecture.


Today’s competitive advantage will go to companies that make the right decisions at the right time. To succeed, these companies must be able to quickly consolidate and integrate massive amounts of data from multiple sources, into a single view of their business that can increase marketing and operational efficiency.

Turning Data into Dollars, a May 2001 Forrester report, indicates that executives charged with setting the strategic direction of their organization understand the value of business intelligence and analytics, yet are faced with the following challenges:

  • Mountains of data create processing bottlenecks—The sheer volume of data collected–– often involving terabyte-scale databases––creates the potential for processing bottlenecks in those systems not designed to effectively cope with this quantity of information. The massive data quantities produce such a heavy processing load that many existing solutions rapidly reach throughput limitations. Without sufficient processing power and highly optimized analytical tools, organizations must often relinquish access to important subsets of their data, or must add expensive hardware and software to increase system performance.
  • Analyzing information from scores of data repositories presents a challenge— Organizations cannot easily access and analyze their operational information because of the difficulty in extracting large quantities of data from multiple data repositories spread throughout the organization. Existing solutions lack the flexibility and robustness for effective data access and extraction.
  • Inflexible models fail to support evolving data streams—Organizations spend significant amounts of development time creating data models, collecting data, defining reports, and constructing consolidated data warehouses. In highly dynamic environments, these systems may rapidly lose relevancy and require ongoing adjustments at the core software or hardware level. Updating these core data model components to correspond with evolving knowledge discovery requirements necessitates a significant investment in time and effort.
  • New analytical views must be developed quickly and on demand—Different departments and groups within an organization have unique needs for the information mined from available data, and these needs change frequently, often on an urgent basis. A practical solution must have the capability of easily providing new analytical views from the data to respond to immediate needs from diverse groups.
  • Extracting key information is inordinately difficult—The ability to drill down through individual transactions and extract the useful patterns and trends is critical to knowledge discovery and analysis. Most existing systems do not adequately provide this capability when there are large quantities of data stored on multiple systems, or in one large system where data access is cumbersome.

The retail environment faces several challenges that can take advantage of a business intelligence solution like Intelligence Engine. For example, in a typical retail environment, data gathered from stores, Web sites, catalogs, as well as suppliers and warehouses provide valuable kernels of knowledge that can increase profitability. Consolidating and analyzing massive amounts of memory-intensive databases requires a robust infrastructure. As a result, many retailers mine only a subset of their available information. To recognize buying patterns over time, massive volumes of sales transactions must be analyzed and correlated by sales channel, region, individual store, product, and customer preference. Within the typical retail environment, this data resides in a number of separate systems.


The Intel® Architecture-based Intelligence Engine solution provides companies with a fast, affordable, and flexible way to consolidate, integrate, and analyze massive, memoryintensive data across multiple systems.

The Digineer™ OI5 DataEngine™, operating on a platform powered by the Intel® Itanium® processor family, performs high-speed analytical processing of massive amounts of event-level data. The data engine can simultaneously write large numbers of data extracts to downstream data warehouses, online and offline reporting tools, and real-time analytics applications. The 64-bit addressing capabilities of the Intel® Itanium® architecture provides the robust computing power needed to run and consolidate memory-intensive databases across channels. The extraordinary performance of the Intel® Itanium® processor family, coupled with the EPIC features, provide an architecture fully capable of adding capacity as needed, and providing heightened performance.

Historically, retailers have addressed the scale and performance problem by creating large data stores using conventional technology. As the volume of data that must be captured and processed increases––often exponentially––processing the massive volumes of data creates a bottleneck and prevents timely access to the detailed information required to run the business. Once this critical point has been reached, retailers must choose between adding expensive hardware to boost performance, or sacrificing the breadth and depth of captured data.

The OI5 DataEngine™ offers an attractive alternative to the scale and performance problem. Retailers use the solution to analyze key retail metrics, such as customer profitability, recency, frequency, volume analytics, supplier performance, and enterprise financials. The OI5 DataEngine™ derives these analytics from a collection of customer relationship management (CRM), supply chain management (SCM), and point-of-sale (POS) databases–– even if the data is distributed among several different machines. The performance provided in these kinds of implementations costs much less than equivalent systems.


The Digineer™ OI5 DataEngine™ delivers:

  • Speed—Performs fast and efficient processing of massive volumes of transactional data, resulting in a consolidated data store that accurately represents the organization’s operational dynamics.
  • Scale—Integrates data from multiple, disparate silos without requiring any modifications or enhancements to the existing data sources.
  • Depth—Includes powerful data query capabilities that efficiently and economically enable data mining of massive amounts of transaction-level event data.
  • Breadth—Provides an outstanding price and performance ratio, making the solution accessible to medium- and large-sized organizations.

PwC Consulting fully integrates the OI5 DataEngine™ into the client’s business processes and technologies. PwC Consulting’s background and experience in business intelligence and analytics produces a solution that enables organizations to rapidly collect, aggregate, manage, analyze, filter, customize, and distribute information throughout their value chains. By providing consolidated views of the dispersed data––rapidly and cost-effectively–– organizations can identify trends and quickly respond to changing needs and situations, thus improving overall business performance.


The Intelligence Engine solution runs efficiently on a 4-way Intel® Itanium® processorbased server. Supported operating systems include (Win64) Microsoft* Windows* 2000 server (Q3’02), Linux* (Red Hat*, version 7.2 for the Itanium® processor), and HP-UX* version 11 (Q4’02).

Components of the Digineer™ OI5 DataEngine™ rely on the following technologies:

  • Core Data Engine™—C++ with multi-threading
  • Result Set Viewer—Java* using Swing*
  • MetaData Manager—Visual Basic*
  • MetaData Database—Any ODBC-compliant database

The OI5 DataEngine™ runs optimally using Intel® EPIC architecture, taking maximum advantage of the Itanium® processor family’s multiple memory registers, and massive on-chip resources. The 128-bit general and floating-point registers excel at supporting the complex operations involved in analytics. Tuning the implementation to achieve maximum performance through parallelism, and tapping into the power of the Itanium® processor family’s multiple execution units makes it possible to rapidly process and analyze the large volumes of data necessary to accomplish this solution.

PwC Consulting solutions employ a variety of technologies using Digineer™ OI5 DataEngine™. By being able to create data stores through the importing and exporting of data files, this solution makes it unnecessary to devise complex, custom integration schemes. Support for each of the following sources and targets is provided:

  • Source databases
    • Oracle*
    • BM* DB2*
    • Sybase*
    • Microsoft* SQL Server*
    • IMS*
    • IDMS*
  • Target analysis engines
    • EMC* Symmetrix Manager*
    • SAP* portals
    • SAS*
    • IBM* Intelligent Miner*
    • Cognos* Powerplay* and Improptu*
    • Custom applications built in Visual Basic, Java, C++, and HTML using Microsoft Internet Information Server* (IIS) and Transaction servers
  • Digineer™ OI5 DataEngine™ has been deployed for use with SAP* and PeopleSoft* ERP packages as sources for data and targets for additional analysis. Future plans include development of interfaces for MQSeries*, and clickstream data sources, scheduled for release by mid-2003.


The Intelligence Engine solution provides rapid analysis of data from multiple sources to gain a better picture of emerging trends and patterns. For a strong competitive advantage, companies will need to develop a quantitative, as well as qualitative, understanding of their business. The Intelligence Engine solution complements existing data warehouse and business intelligence solutions. It can also maximize the performance and extend the functionality of systems already in place.

The key vertical integration areas in which the Intelligence Engine solution can provide significant value include:

  • Retail—Domestic and international (Western European) retailers with annual sales of more than $1 billion, and large-scale customer, supplier, point-of-sale, and operational databases.
  • Manufacturers—Mid-market manufacturers with annual sales of more than $500 million and one or more large silo databases.
  • Financial Services—Financial services organizations with multiple geographic locations and stringent operational and regulatory reporting requirements.
  • Healthcare—Pharmaceutical and biotechnology companies that depend on data from multiple systems to manage and optimize their operations.
  • Government—Agencies under pressure to provide more extensive reporting of data, over which the organization has little governance. In addition, data sources continue to propagate, grow in size, and are increasingly complex and divergent from each other. 5


The PwC Consulting Intelligence Engine solution using Digineer™ OI5 DataEngine™ adapts well to environments where operational data is distributed among several different systems––sometimes in different data formats––and where the report requirements change frequently. Implementing the Digineer™ OI5 DataEngine™ on servers powered by the Intel® Itanium® processor family offers these benefits:

  • Provides faster high-volume transaction processing. Supplies the massive computing power needed to run and consolidate memory-intensive databases across channels. This enables key decision makers and business owners to obtain accurate business intelligence and analytics, and quickly gain deeper insights into the dynamics of their operations.
  • Supports multiple data formats. Supports multiple applications and loads required to consolidate and analyze data throughout the company. This enables the solution to easily access existing data stores and business intelligence solutions without modifications or retrofitting.
  • Adapts easily to changing data needs. Accommodates an organization’s evolving data needs and models so that new informational requirements can be met with minimal investment of time and money. Integrates different processes and organizes these in an easier and more cost-effective manner with a modular, open environment based on industry standards and Intel® Architecture. Solutions based on the OI5 DataEngine™ as deployed on servers powered by the Intel® Itanium® processor family provide a highly competitive price and performance ratio. 6


Maintaining a current and accurate view of an organization’s operational status requires the capability of locating and analyzing information from a very large data store in a short period of time. This problem can be compounded when data resides in disconnected data silos, or has been extracted from a variety of non-correlated reports. Reducing the scope of the data analyzed in order to save time or expedite processing can create an inaccurate view of current operations. Processing bottlenecks and inefficiencies often result from solutions implemented on platforms without a tuned architecture capable of handling data quantities that reach terabyte levels. While increasing the quantity and complexity of a solution’s hardware and software may improve performance, it also raises costs. A better approach is to deploy an implementation on a platform designed to accommodate massive volumes of data with processing power that is suitable to the task. The Intelligence Engine solution using the OI5 DataEngine™ on a platform powered by the Intel® Itanium® processor family performs very high-speed, analytical processing of massive amounts of event data. This event data can be dispersed across any number of data sources, including enterprise resource planning (ERP), SCM, CRM, and legacy systems–– without reducing the effectiveness of the solution. Processed data can then feed downstream data repositories, analytical and business intelligence tools, and executive reporting systems, as shown in the following diagram.

The OI5 DataEngine™ uses a patented software approach that leverages the latest advances in microprocessor technologies featured in the Intel® Itanium® architecture, including 64-bit addressing and parallel processing capabilities.

The open architecture of the OI5 solution allows data to be extracted and transformed directly from legacy systems. OI5 provides a massively scaleable, high-performance solution for the most demanding intelligence and analytics applications


Define Metadata

The first step to using OI5 is to define the metadata, including record layouts, files or tables, and relationships between tables. The OI5 design removes the complexities of understanding table relationships. Common keys and other relationships between tables only need to be defined once. Instead of having to understand the “where” statement in SQL to add department names to a sales report, OI5 lets users select the department name and then add it to their report. OI5 automatically locates the correct data.

Define Views

Next, define the views (queries):

  • Select the transaction files to read
  • Specify the appropriate output format
  • Indicate the appropriate filters to apply
  • Indicate the data to sort or summarize
  • Design columns, and indicate any calculations to perform
  • OI5 does the rest

Run Processes

During data file output, OI5 produces multiple files in one execution of the process. This feature reduces the time needed to supply data to other systems. OI5 views can reformat data,perform logic operations, and make calculations. With the open API, custom functions can be created for other types of transformation. These features make OI5 a powerful data extraction and transformation tool, as well as a flexible application engine.


The OI5 DataEngine™ helps large organizations meet the scaleability and performance challenge of assessing key organizational operations when the number of events being measured and analyzed becomes too great. The solution supports queries directed to extremely large volumes of event-level data in a single pass.

Data exposed by the OI5 DataEngine™ can reside in different physical locations on multiple systems, including ERP, CRM, SCM, and POS systems. The front end of the OI5 DataEngine™ maps these “in-place” data sources into the data engine’s MetaData definition modules. The engine then uses the MetaData definition to intelligently pre-process the queries before generating the data views and outputting data stores for analytical processing. The data output can either be viewed with the OI5 Viewer or directed to other business intelligence and reporting systems.

The core, patented technology of OI5 is configured for refreshing and managing a largescale operational data store (ODS) derived from multiple data sources. As shown in the following diagram, it includes three integrated modules that deliver the full power of the OI5 technology and its massive data throughput: OI5 MetaData Manager, OI5 View Manager, and OI5 Core Processor. The OI5 DataEngine™, as implemented on servers powered by the Intel® Itanium® processor, runs on Microsoft* Windows* 2000 Server (64-bit) and HP-UX systems.

The parallel query-processing capabilities of the system follow a defined software algorithm that is optimized for speed. The algorithm first pre-processes all the data queries based on the data engine’s knowledge of where and how the event data is structured and formatted (determined by the MetaData definition in combination with Read Exits), as well as the required output data stores (known as Views). Optimized code generated by this query pre-processing step gets sequenced into a logic table. When the logic table is processed, the OI5 DataEngine™ spawns parallel query instructions that minimize instruction path lengths to the available microprocessors. This explicitly parallel query-processing directly takes advantage of the EPIC architecture of the Itanium® processor family. The OI5 DataEngine™, using instruction path optimization and single-pass data queries, offers unmatched query-processing and exceptional handling of the data view output.


The OI5 DataEngine™––the basis of this data warehouse analytics solution––is optimized for the Intel® Itanium® processor family. The following diagram shows how information from existing data warehouses is delivered to the OI5 DataEngine™ for advanced analytics processing. The OI5 MetaData and View Manager applications, running on a workstation powered by the Intel® Pentium® 4 processor, define the relations and views for the associated processing and analytics. Processed data can be directed to Web portal solutions, as well as custom applications and reports, or supplied in a form for re-entry into an existing data warehouse.

The Web portal is powered by the Intel® XeonTM processor family. The OI5 DataEngine™ is powered by a 4-way Intel® Itanium® processor-family based server.


PwC Consulting has used the OI5 DataEngine™ in numerous scenarios that have been designed and deployed for clients. This solution achieves favorable results in knowledge discovery applications that involve massive numbers of transactions. Running on servers powered by the Intel® Itanium® processor family, the underlying OI5 software data engine performs very high-speed, batch-analytical processing of large volumes of event data–– while simultaneously writing multiple data views to downstream data warehouses, online and offline reporting tools, and real-time analytical processing applications. When used in combination with Intel® Itanium® processor family-based platforms, the OI5 DataEngine™ offers an exceptional price and performance value.


For general information about the products described in this data sheet, visit: www.digineer.com


If you have a specific question about implementing this solution within your organization, contact your Intel representative or e-mail us at:




Digineer, the Digineer logo and OI5 DataEngine are trademarks or registered trademarks of Digineer, Inc.

Copyright ©2002 Digineer. All Rights Reserved.

Intel, the Intel and Intel Inside logos, Pentium and Xeon are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries.

Copyright ©2002 Intel Corporation. All Rights Reserved.

*Other names and brands may be claimed as the property of others. Information regarding third party products is provided solely for educational purposes. Intel is not responsible for the performance or support of third party products and does not make any representations or warranties whatsoever regarding quality, reliability, functionality, or compatibility of these devices or products.

250790—001 and 250789—001

Global Investment Bank Analyzes Trade Level Detail: 1999

By Business, History, Technical
[An investment bank arm], a division of [a major Swiss bank], chose Geneva ERS to manage its Detail Repository, a central component of the Global General Ledger SAP implementation.  This repository provides trade-level detail analysis to for internal and external global reporting.  [Bank] management has referred to the global general ledger as “a cornerstone of [the bank]’s strategy and its successful realization globally is vital to all future plans.”

Approximately 1.3 million detailed transaction data records from twenty-two different feeder systems are loaded into the Detail Repository nightly. These transactions are trade level detail records from Europe, Asia Pacific, and North America.  Geneva ERS scans the repositories’ 51 million records in 22 entities and 269 physical partitions.  It extracts 20 million records that are aggregated into approximately 480,000 summary balances.  These summary records are sent to SAP for balance sheet and summary profit and loss reporting.  This process runs in approximately 3 hours of elapsed time and 5 and ½ hours of CPU time and produces 30 different outputs.

A second Detail Repository process uses Geneva ERS and custom programs to satisfy the intricate regulatory requirements.  This system consists of 65 Geneva “Views” or programs, 4 custom programs, and 5 PERL scripts.  Geneva is executed 19 times with each execution handling a subset of the core business requirements.  During this nightly process Geneva reads 71 million records in 40 gigabytes, extract 59 million records in 30 gigabytes, and performs 229 million table joins.  The output is created in 12 CPU hours and 8 wall clock hours.  In comparison, legacy applications required 24 hours to complete a limited number of these processes.

Outputs from these processes are used for US tax and regulatory, Asia specific regulatory management, and Swiss regulatory reporting.  They include information on:

  • Capital Allocations
  • Average Balancing and Multi-Currency Revaluation
  • Syndicated Loan Netting
  • Federal and Swiss Regulatory Collateral Allocation
  • Residual Maturity
  • Interest Rate Selection
  • Product Risk Weighting
  • Specific Reserve Allocation
  • Unutilized Facility Calculation.

The view outputs include files used in additional processing or to feed other systems, delimited files, tabular reports, and inputs to a sophisticated executive information system.  The executive information system allows users to select which report to view and for what period.  The user is presented with the highest level summaries.  The user can then drill down into specific areas of interest, select ranges of data, sort columns, view other dimensions of the same data, graph the data, and export to spreadsheets.  The executive information system is accessed by as many as 50 users throughout the world.

The Geneva Views are maintained in Sybase tables accessed by Geneva ERS Visual Basic ViewBuilder front-end.  The software maintains various types of metadata including record layouts, field types, join relationships between record structures, logical and physical file partition information, as well as the actual query selection, summarization, and formatting logic.  The business logic contained within the views ranges from simple transform logic to the sophisticated globally defined business rules that make up the global general ledger accounting model.

More about this time can be read in Balancing Act: A Practical Approach to Business Event Based Insights, Chapter 20. Parallelism and Platform.

SAP High-Volume Operational Reporting/Data Warehousing: 1996

By Business, History, Technical

Summary of Sizing Concepts and Architectural Alternatives

Richard K. Roth

September 15, 1996

[More about this time can be read in Balancing Act: A Practical Approach to Business Event Based Insights, Chapter 21. ERP Reporting.]


Reasons for performance problems being encountered with operational reporting and data warehouse functions in medium- and high-volume SAP implementations are described.  A generalized approach for estimating reporting workload early in an SAP project is provided along with throughput metrics that can be used to assess whether performance problems are unlikely, probable, or certain.  The integrated ABAP/Geneva solution that was implemented to resolve the performance problems for Keebler is described, including comparative processing metrics for the alternative architectures.

This paper should be used in conjunction with the Data Warehouse Size Estimation (DWSE) model and instructions available for download from the IT Knowledge Net under the topic “Estimating Report Process Requirements.”


Experience inside and outside PW is demonstrating that the LIS, FIS, HIS, and EIS facilities of SAP are not adequate to support the full range of operational reporting or open information warehouse requirements for medium-to-large SAP implementations.  Operational reporting and open information warehouse requirements in practice mean the ability to “slice and dice” the granular operational data captured by SAP applications for a variety of business purposes.  Since our client base for SAP implementations consists of larger organizations, it is important for Price Waterhouse to identify up front when problems are likely to be encountered and be ready with solid recommendations on how requirements in this area should be accomplished. 

“Data warehousing” frequently is assumed to be the solution for the functional void in reporting that needs to be filled.  A problem with this terminology is that “data warehousing” has come to mean different things to different people.  And in some quarters, lumping the problem of slicing and dicing granular SAP data under “data warehousing” has resulted in an inadvertent underscoping of the general reporting problem that needs to be solved.

Common concepts or topics that frame notions of data warehousing include:  varying levels of data summarization, limitations on data content to the needs of particular business purposes (subject areas), time variant snapshots, focus on interactive usage and GUI presentation,  performance measurement, limited scope pilot applications as a way to get started, and data warehousing software and hardware most widely discussed in the technology press.  All of these things have their place under the right circumstances.  But, approaching the problems being faced by big businesses with new SAP applications in these terms can be misleading with respect to the fundamental reporting problem that needs to be addressed.

Reporting Problem in General

Implementing a new SAP application means that the corresponding legacy system reporting framework being used to run the business is being replaced (presumably improved).  Depending on the application, the old framework typically consists of hundreds to low single thousands of reports (on-line and hard copy).  These reports were developed in response to business problems and ways of managing the enterprise that evolved over time.  Certainly, a large portion of the legacy reports are obsolete and a new concept and framework for reporting will consist of many fewer more targeted reports.  But, what is important here is that the mission critical reporting framework must be replaced as part of the SAP implementation.  This is a general problem that should not be approached inadvertently with a set of limiting assumptions about what data warehousing is or is not.

Reporting problems come in varying sizes depending on the data basis and size of the reports in the mix.  The data basis for a report means the number and size of data records that must be manipulated in order to produce the report.  Report size means the number and length of the lines that constitute the final report output.[1]  By far the largest single factor in determining overall reporting load is data basis. 

The most important aspect of data basis usually relates to the number and size of records that must be extracted then further manipulated to produce the report output.  However, under certain circumstances, the number and size of records that must be scanned and evaluated as candidates for extract can be the primary determination of data basis.  In cases where the extract data basis is small, but large sequential table scans must be executed in order to identify the extract records, input table size becomes the relevant data basis for projecting reporting load[2].

In general, the size of an overall reporting load can be determined by aggregating the data basis and report sizes across the number of reports and frequency with which they will be executed in a given period of time.  By stratifying reporting requirements in categories of on-demand, daily, weekly, monthly, quarterly, and annual, an accurate picture of report processing load over time can be projected[3].   It generally is a good practice to project detail reporting load separately from the summary reporting load.

Until an attempt is made to project the reporting load in this way, it almost certainly will be underscoped.  Fortunately, most reports (largest number of reports) are low data-basis operational reports that typically are small in terms of data basis and report size.  As a class, these reporting requirements do not contribute greatly to overall reporting load and do not need to be evaluated in detail.  However, high data-basis reports tend to be comparatively few in number but account for the bulk of the overall reporting load.  Underscoping risk primarily is in the area of high data-basis operational reporting requirements.

Low data-basis operational reports tend to deal with business problems involving open orders or transactions for individual customers, or other small subsets of data constrained to a limited period of time.  Very often the indices that support SAP transaction processing also support the access requirements of operational reporting.  Consequently, limited table scans are required to access the few records in the data basis for these kinds of reports.  In general, the majority of low data-basis operational reporting requirements can be satisfied with a data basis under a hundred records and most with under a thousand, making them good candidates for on-demand processing.  Even a large number of people executing these kinds of reports frequently will not create an unmanageable reporting load.

High data-basis operational reports tend to include requirements for retrospective analyses of transaction processing over time.  High data-basis operational reporting requirements come from all major aspects of a business whether they be marketing, sales, product development, financial, production, human resources, procurement or legal to name some major constituencies.  And all of these constituencies are interested in the basic transactions captured and processed by the SAP modules for many of their analytical purposes. 

Keebler High Data-Basis Reporting Problem

Although the number of high data-basis operational reports tends to be dramatically smaller than the number of low data-basis reports, the data basis and size for these reports tends to be dramatically larger.  The experience at Keebler clearly bears this out:  in the original analysis of high data-basis operational reporting requirements, only 45 sales analysis reports were identified, but the data basis for these reports turned out to be approximately 23 gigabytes per week on average[4] (peak loads at end of period are much higher than the average indicates).  Examples of these reports are included below:

  • FS Sales Force Ranking
  • FS Item Customer Pch:  SIUs
  • FS Item Customer Pch:  Cases
  • Food Service Ranking – ITEM
  • Food Service Ranking – Customer
  • Food Service Comparison CASES
  • Food Service Comparison $$$
  • Food Service Budget – PTD to HYTD
  • Item Sales to Budget (Div.)
  • Diversified Retail Sales – Final
  • SBU Item Sales for Period:  (by Sales Org)
  • SBU Item Sales for Period: (SBU-Brand-Item) Corp Type
  • Item Sales to Budget (Store Door)
  • Budget Buster RANKED
  • Final Sales BT
  • Budget Buster
  • Military Account Item
  • Account Item – Trend & YTD
  • Customer Purchase (Store Detail)
  • Gain & Loss, w/ Allow.
  • FS Cat. Grouping

As should be evident from the titles, these do not represent “nice to have” reporting requirements, but are blocking and tackling period-to-period comparisons of key volume statistics needed to run the company.  The 23 gigabytes also does not include Keebler-specific requirements for data reorganization due to material, sales and other account hierarchy changes, legacy systems data integration, replacement of  reporting currently outsourced to Synectics and substantial ad hoc requirements that are anticipated.  Unless a detailed analysis of the high data-basis operational reporting requirements is done, it is easy to dismiss these kinds of basic reporting requirements as something that will be handled in background processes when time is available.

In order to understand whether a given reporting load could be handled in background when time is available, it is necessary to have some basic metrics about how fast this kind of processing can be done in a given environment.  Assuming an HP T-500 8-Way world with EMC disk (the Keebler configuration), extract processing can be done at a rate of about 6 megabytes/minute (if table joins are required, calculations need to be adjusted to reflect processing at a rate of about 3,000 joins/minute).  These numbers are consistent with information from SAP on performance in reporting and general Price Waterhouse experience with UNIX SMP machines running third-party relational data bases.  In the Keebler case, this would mean approximately 65 hours per week of continuous processing just to complete the initial extract, assuming no table joins were required.  Taking into account table joins would mean the basic extract processing would take about 260 hours of continuous single-threaded processing.  Further, according to SAP, a general rule of thumb is that reports with a data basis of several hundred thousand records will require several hours to process (this is consistent with our Keebler observations).  Considering that the 23 weekly Keebler gigabytes represent about 210 million extract records, something on the order of 2,000 processing hours per week would be a reasonable estimate (albeit a practical impossibility to implement) of the end-to-end processing load, given the Keebler environment. 

Another consideration is that, according to SAP, report sizes in excess of 50 megabytes cannot be produced at all because of application server memory limitations (no Keebler attempt at this kind of reporting got far enough to verify this limitation).  This would mean that reports would have to be split up (stacked), which in turn would increase the extract scanning load substantially.

*          *          *

To recap the Keebler experience, it was assumed at the outset that most reporting could be handled as a data warehouse problem through the SAP SIS and EIS facilities.  Early experience with using these facilities highlighted the fundamental throughput problems and other alternatives were explored, including popular data warehouse tools.  However, it was when the detailed analysis of the basic high data-basis operational reporting load was completed that the scope of the problem became apparent.  No rationalization about data warehousing being limited to performance measurement or summary level numbers could eliminate the need to produce the reports or reduce the data basis required to generate them.  Understanding the scope of the problem in terms of data basis was the key to focusing attention on architectural solutions that would work at scale.

Projecting Data Basis

Data basis is a function of:

  • Transaction volume;
  • Data structure complexity; and
  • Number of reports, level of detail, and content requirements.

Transaction volume.  Transaction detail represents the richest source of information about financial and other events captured by a system.  As part of transaction processing, information entered about individual events is augmented with data from master tables according to business rules that constitute the accounting model in place at that point in time.  Depending on SAP module and configuration options implemented, detail transaction files (e.g., SD COPA CE1 table) are posted in addition to various summary balances in the SAP data model[5].

In principle, a cumulative inception-to-date transaction file plus the related effective-dated reference tables could be the source data for all types of operational reporting.  Projecting data basis for this reporting architecture would be relatively easy:  total gigabytes of transaction files that would have to be scanned for each report, taking into account the number of reference table joins that would have to be performed as part of processing, extended by the frequency for reports over a period of time, would be the data basis for this kind of reporting load.

Taking the Keebler SD case as an example, 120,000 1,608-byte CE1 transaction records per day comes out to about 4.5 gigabytes per month or 111 gigabytes for the two years of history.  The 45 basic reports actually represent an average of about 5 alternate sequences per report, so taking into account the repetitive aspects of daily, weekly, and monthly cycles, there are about 700 elemental reports that have to be produced to accomplish the basic high data-basis monthly reporting requirement.  Producing one set of all reports directly off the detail would mean a data basis of something like 2.8 terabytes, clearly an impractical load to process at a rate of several hours per gigabyte, or even at a rate of several gigabytes per hour. 

For audit trail or archive reporting purposes, there is no way around facing up to the volumes associated with cumulative transaction files, unless these requirements simply are ignored[6].  However, for most summary reporting purposes, effectively designed summary files can dramatically reduce the data basis required to accomplish these requirements.  The degree to which data basis reductions can be realized though use of summary file schemes depends primarily on data structure complexity.

Data Structure Complexity.  The simple physics of routinely reprocessing raw transaction data as the method for producing reports was the driver behind development of the manual accounting cycle.  By determining the important summary totals required in advance, then classifying all transactions accordingly as they were journalized, detail could be archived period-by-period leaving a practical size data basis for report processing over time.  In the manual world, this required maintaining a separate set of account structures and subsidiary ledgers for each summarization structure.  General ledger accounting, cost accounting, revenue accounting, tax accounting, fixed asset accounting, inventory accounting, regulatory accounting, receivable accounting, payable accounting, and so on, all had needs for different sets of summary totals derived from various subsets and supersets of the same basic transactions.  As automation of these reporting processes advanced through the ‘60s and ‘70s, the basic subsidiary ledger structure of the manual world was preserved in the subsystem architecture of computer-based applications.

As computers became faster at re-sequencing the same basic transaction set a variety of different ways, more granular account structures started to emerge in subsystem implementations.  Product, organizational, and project dimensions started to be included in general ledger account structures by the early ‘80s simply because it became practical to extract, sort, and summarize a few hundred thousand to a million or so fully qualified summary accounts several different ways, something that was strictly impractical in a manual world.  As a result, separate subsystems were no longer required for each business function that required a different roll-up of transaction detail.  But, the determination of what could be consolidated and what had to be broken out in a separate subsystem continued to be arbitrated primarily based on the volume of fully qualified accounts implied by the dimensions included in the account classification structure. 

SAP presents a highly integrated profile for transaction processing[7], which leaves the impression that highly integrated cross-dimensional views of the data captured also should be available.  However, there are virtually an unlimited number of ways transaction-level data from SAP could be summarized taking into account all the possible cross-structural and time-delimited combinations.  Fortunately, for a given implementation, patterns in the use of cross-structural combinations usually emerge that permit reporting data basis to be minimized by taking advantage of “locality” in the transactions with respect to account classification structures (i.e., maintaining summary files).  Locality means that a large number of transactions create a much smaller set of combinations when summarized by a set of cross-structural dimensions.  How substantial the locality effect is depends on transaction coding patterns, the cross-structural dimensions that are required given the patterns of report usage, and the number, breadth, and depth of the structures — data structure complexity.

The benefits of locality are usually substantial.  But summary files are often assumed to be the “big” answer that makes the high data-basis operational reporting problem go away.  The truth is that summary files quickly get larger than the detail files if a new summary file is created for each new use of the data that requires a different slice, which means that the summary files become a significant volume and maintenance problem unto themselves.  In practice, summary files should be maintained to the point where the data basis reduction related to report production is greater than the data basis created by the need for summary file maintenance functions.  In addition, summary files should be employed as sparingly as possible due to the substantial system administration requirements that attend their maintenance and use.  In the Keebler case, only three (3) different cross-dimensional summary files maintained in multiple partitions by time-slice were required to satisfy these conditions, which is typical for data models of this type.

Modeling the degree of locality can be done fairly easily by using data from existing systems to calculate the collapse rate across structures.  Three-to-six months of the two-to-four highest volume transaction types related to the structures in question usually is enough.  Simply sorting the transaction files and calculating the average number of unique cross-structural key combinations by week, month and quarter will give a reasonably good picture of what kinds of collapse rates can be expected.  The result usually is far less dramatic than is assumed in the absence of a detailed analysis.

Frequently, special reporting requirements will emerge that have not been (or cannot be) anticipated and are not supported by summary files.  If the requirements relate to summary reports that will be produced with some frequency, the summary files can be modified and regenerated or new summary files can be created by reprocessing the detail transaction data.  If the requirements relate to special requests with no pattern of usage, or if they are audit trail or archive requirements, or if requirements are that summary files reflect the current account structure hierarchies, there must be a way to process the transaction-level detail, or the requirements must be ignored.  Said differently, unless all summary reporting requirements can be specified precisely at the outset, and audit trail, archive, or structure reorganization reporting is not a requirement, processing the transaction detail will be unavoidable.  Summary files are not the “big” answer.  They are just an important part of the solution.

Number of Reports, Level of Detail, and Content Requirements.  Easily the biggest driver of data basis is the requirement to report at a low levels of hierarchies in cross-structural combinations.  Subsystem reporting requirements typically are limited to summarizing data one structure at a time.  As a result, the number of fully qualified accounts that get created by even large volumes of transactions is limited to the number of accounts in the given structure.  By avoiding the need to report by customer/product/organization, vendor/SKU/location, employee ID/general ledger account, and so on, subsystem architectures limit their reporting data basis exposure — and consequently their usefulness.

Reporting at low levels of cross-structural detail usually is important because that is the level at which day-to-day resource allocation decisions are made.  Inventory replenishment, sales force management and compensation, promotional campaigns, and crew scheduling are examples of large dollar investments managed in atomized quantities of resources, location-by-location, vendor-by-vendor, days at a time.  High-level performance measures that show out of whack inventory turnover rates or sales per employee do not help identify the out-of-stock problems that need to be fixed or the particular sales calls not getting made.  The thousands of reports in legacy systems got created because figuring out what was going wrong required a specific cross-structural slice of the data to illuminate the drivers about which something could be done.

It is unlikely that thousands of reports are needed to run any business at a given point in time.  But, over time, the number of problems that need to be solved certainly accumulate into the thousands.  In the world of subsystem architectures, a new problem meant building a new reporting subsystem (that mostly never get retired).  In the integrated SAP world, the underlying assumption is that whatever slice is needed will be available because the transactions were captured by an integrated process.  But, as we have seen, availability of any slice means access to detail or maintenance of every summary total (a clearly impractical alternative).

Two basic approaches to defining reporting level of detail and content requirements can be taken:

  • Produce a pre-defined set of reports in case they are needed.
  • Provide a framework that can generate almost anything, but produce only what people ask for.

The first approach is easiest to manage from a project standpoint.  But, limiting the set of reports to a practical number is difficult.  And it results in an enormous amount of processing for generating numbers nobody looks at.  Also, it is not what most clients have in mind when they embark on implementing a new integrated information system.  The second approach is preferable, but more difficult to achieve.  Something that takes both into account is required.

An exercise that can be done very early in the project that promotes achieving the right balance is as follows:

  1. For the high-volume aspects of the modules being implemented, define the transaction-level data that constitutes the basis for operational reporting. 
  2. Define the lowest level of cross-structural reporting that will be supported. 
  3. Define the inter-period comparison (e.g., last year/this year, structure re-organizations) features that will be generally supported. 
  4. Define a reasonably inclusive set of cross-transaction accumulators (e.g., promoted/not promoted sales, gross sales, returns, statistic 1, statistic 2, etc.) that will be generally supported. 
  5. Define a basic set (20-40) of potentially high data-basis operational reports to be implemented that represent ranges in level of detail and content.
  6. Design the cross-dimensional summary files that efficiently support producing those reports over a two-year time horizon.
  7. Calculate the data basis for generating and maintaining the summary files and producing the reports.

(An Excel Data Warehouse Size Estimation (DWSE) model, with instructions, is available for download on the IT Knowledge Net to facilitate doing these data basis projections; see topic “Estimating Reporting Process Requirements.”)

The summary of results from the model for the current Keebler implementation are provided as follows:

FrequencyNot NormalizedTotal
DaySum of Report Count                  17
 Sum of Table Joins – Total    124,818,052
 Sum of Recs Read – Total      42,960,000
 Sum of MB Read – Total            34,005
 Sum of Records Extracted      22,774,204
 Sum of MB Extracted              2,004
WeekSum of Report Count                  54
 Sum of Table Joins – Total    873,863,473
 Sum of Recs Read – Total    182,064,000
 Sum of MB Read – Total          173,751
 Sum of Records Extracted      57,880,800
 Sum of MB Extracted              5,708
PeriodSum of Report Count                  75
 Sum of Table Joins – Total    507,524,187
 Sum of Recs Read – Total    506,832,000
 Sum of MB Read – Total          337,157
 Sum of Records Extracted      56,289,420
 Sum of MB Extracted            21,670
QuarterSum of Report Count                    2
 Sum of Table Joins – Total 1,514,661,174
 Sum of Recs Read – Total      16,584,000
 Sum of MB Read – Total              7,500
 Sum of Records Extracted      10,209,000
 Sum of MB Extracted                876
Half yearSum of Report Count                    9
 Sum of Table Joins – Total                   –  
 Sum of Recs Read – Total    105,300,000
 Sum of MB Read – Total            89,771
 Sum of Records Extracted      19,908,000
 Sum of MB Extracted              6,993

Even though specific configuration details are not available early in the project, the drivers of data complexity, level of detail, and report content are sufficiently predictable based on existing business processes to yield very useful results.  This exercise forces the scale issue to surface early so that informed decisions can be made about hardware and software platform.  It illuminates the generalized nature of the reporting problem and promotes implementing a generalized solution rather than a one-program-one-report architecture. 

Demonstrating a generalized solution helps with scope control because it is clear how new requirements can be accommodated as they evolve over time.  It minimizes the tendency to define everything the existing system does plus everything new that can be thought of into the requirements.  And it shows that so many possibilities exist that there is no hope of producing everything just in case somebody wants to use some of it. 

Most importantly, this exercise makes it possible to make meaningful economic decisions about how much and what kind of reporting will be supported.  The big drivers of cost in reporting are data basis and number of reports.  By creating this model early in the project that can be exercised as new ideas are explored, expectations about reporting requirements and technology can be managed on a factual rather than emotional basis.

Components of a High Data-Basis Operational Reporting Solution

Given that SAP facilities alone are not adequate, something else is required.  A basic assumption in the market is that some sort of data warehouse solution probably is the answer.  In addition, the components of the architecture are widely assumed to be based around SMP or MPP hardware, a relational and/or multi-dimensional data base and some OLAP presentation-layer tools.

The problem is that the heavy data manipulation load associated with high data-basis operational reporting is not a process that works well on bus architecture machines using these kinds of access methods.  Only by ignoring the high-volume aspects of reporting requirements can these kinds of solutions be expected to work in a cost-effective way.

Current SAP operational reporting/data warehouse strategies seem to fall into three major categories:

  1. Use SAP summary level reporting features (SIS, LIS, EIS, etc.) and access the data directly from SAP data structures to support very low data-basis operational reporting requirements.  General “slicing and dicing” of granular detail cannot be supported.  Practical limits on summary table size are in the 10,000s of rows.
  2. Extract SAP data and create a series of summary files in relational or multi-dimensional data bases for access by third-party data warehouse tools to support low-data basis operational reporting requirements.  General “slicing and dicing” of granular detail cannot be supported.  However, options for broader-based interactive access to pre-defined summary tables are available compared to native SAP facilities.  Practical limits on summary table size are in the 100,000s of rows.
  3. Extract SAP data and create a separate operational data store containing detail and summary files as required.  Use Geneva or custom COBOL/Assembler programs for extraction and reduction processing on an MVS machine.  Use third-party data warehouse tools or SAP/ABAP facilities for presentation-layer functions on MVS, UNIX, or other platforms as required.

In low data-basis situations, or where reporting requirements can be constrained, or where reporting requirements can be precisely defined at the outset and remain stable over time, a creative implementation of Strategies 1 or 2 may be adequate.  However, for larger clients with even medium transaction volumes and moderately complicated data structures, including an MVS-based operational data store in the architecture (Strategy 3) is a very practical solution for dealing with the inherent data bandwidth problem.  In addition, the MVS environment provides for managing and achieving high throughput for the multiple concurrent batch processes characteristic of operational reporting problems.  Complex batch streams with contingent dependencies are not similarly well supported in the UNIX world.

The Strategy 3 architecture that was implemented to solve the Keebler problem very successfully demonstrates the essential elements of a general solution:

In this solution, the heavy data manipulation processes and archive requirements were off-loaded from the SAP UNIX machine.  This left the SAP R/3 environment free to be tuned for transaction processing, unencumbered by extensive summary file posting requirements and accumulating detail data volumes.  High data-basis report generation takes place on an outsourced MVS machine and answer sets are returned to the SAP/UNIX environment for access through an ABAP report viewer that appears to be a completely integrated part of the SAP application.  It is an extremely cost-effective way to provide broad-based desk-top access to high-volume operational data.

Typical CPU and wall times for end-to-end processing of the reporting load described above using the Geneva engine are as follows:

             Frequency                            CPU Time                   Wall Clock Time

Daily                                                   15 min                             1 hour

Weekly                                                150 min                           3.5 hours

Period Summary File Build       300 min                           8 hours

Period Reports                               350 min                           6 hours

(CPU/wallclock ratios vary due to differences in tape processing requirements, time of day the jobs are run, and whether the Geneva configuration is set up for single or multi-threaded processing — net result:  cumulative processing time over the course of a month is approximately 26 CPU hours)

A more general presentation for high-volume situations that accounts for the various tool sets that could be employed would look as follows:

Event-level data from SAP, legacy, and other operational systems would be extracted and transformed using various tool sets as appropriate under the circumstances.  The transform process would primarily deal with data cleansing issues (i.e., gaps, inconsistencies, and errors).  Data reduction (i.e., summarization level and content) would be left primarily to the  ODS component managed by Geneva or custom programs.  The primary advantage of this separation is that this provides a systematic framework for re-running reductions over time as new requirements emerge.  Reductions can be in the form of end-user report files delivered directly through some presentation layer, as was done in the Keebler case with ABAP on the UNIX machine.  Or they can be data sets for downstream processes, including replacements or updates to relational or multi-dimensional data bases, data mining applications that require pre-processed inputs, or interfaces to other systems.

Absent an ODS component in medium- and high-volume situations, requirements must be defined very precisely at the outset, they must be stable over time, and limited to what the assumed technology platform can practically accommodate.  As early SAP implementation experience has shown, long-range requirements are difficult to define.  And volumes that can be feasibly accommodated are dramatically less than most imagine.

However, by incorporating an ODS on a high-data bandwidth platform for systematically reprocessing transaction-level data and maintaining/re-generating summary file structures, highly cost-effective and practical scale and flexibility can be designed into the architecture from the outset.  SAP high data-basis operational reporting and data warehouse problems can be readily solved by using right tools for the right job in medium- and high-volume situations.

[1] For example, a data basis of all invoices for the last year would be a large data basis when compared to only invoices for yesterday, which would constitute a relatively smaller data basis.  Last year’s invoices or yesterday’s invoices could be summarized at a high organizational level such as division, which would result in a small report size, or either data basis could be summarized by customer, item and week, which would result in a relatively larger report size.

[2] Detail (audit trail) reporting is a class of analytical reports that fits this profile. The general problem is to enumerate the detail transactions that support summary numbers accumulated as part of transaction processing, or summary numbers generated on other operational and analytical reports.  Unless an index is available that permits going directly to the small subset of  transactions required, sequential table scans are required to pick out the transactions that qualify for a given set of selection criteria.  If the criteria require that table joins be executed as part of the processing, this needs to be accounted for accordingly.

While the extract data basis and consequent report sizes are small for the bulk of detail reporting requirements, the number of detail report requests can be substantial and the primary selection criteria can be based on any field in the data base, not just those with indices.  The more robust the summary analytical reporting framework, the more demand there will be for detail reporting to substantiate the summary numbers.  And the more likely it is that the robust summary numbers will be based on dimensions of the data base that do not have indices to facilitate selection processing.

[3] The nature of the processing necessary to produce reports also is an important factor.  To the extent that the source data structures for a report are denormalized, or a report requires only a single table for input, data basis and size provide a good surrogate for describing reporting loads.  To the extent that normalized structures are involved that force numerous table joins to be performed as part of extraction, calculation and formatting processes, projections of overall reporting load must be adjusted accordingly.

[4] The weekly data basis would have been over 20 times the 23 gigabytes if all processing had been done using transaction level data (basic transaction volume is 120,000 order lines per day @ 1,608 bytes per line).  Based on the reporting requirements identified, summary file structures were defined that would reduce total data manipulation load necessary to produce the reports.  As new reporting requirements emerged during development, summary files were repeatedly modified and regenerated from the transaction detail as necessary to support efficient production processing.  It is expected that the summary file structures will continue to be modified and regenerated over time as new reporting requirements evolve.

[5] In some cases, obtaining detail transaction records from SAP is difficult for a variety of reasons.  Since this is the richest source of information, as well as the basis for auditability of numbers generated by SAP, it is important that we resolve the open issues and develop consistent recommendations surrounding how all necessary transaction detail should be obtained from SAP, where it should be stored under varying volume situations, and how it should be accessed.

[6] Fortunately, audit trail reporting tends to involve large table scans but small extract sets, which means that processing subsequent to extraction is minimal.

[7] SAP highly integrates the interrelated aspects of transaction processing that traditionally have been disconnected by subsystem architectures.  As transaction detail is captured in SAP, a very rich set of attributes about each business event is recorded.  On a go-forward basis, business rules can be modified to accommodate new requirements or other changes over time.  And SAP provides for posting a wide variety of summary records at transaction processing time to facilitate inquiry and operational reporting requirements.  But, the determination of what summary totals will be required for each aspect of managing the business remains part of the up front configuration exercise.  The ability to adapt to retrospective information requirements that were not foreseen up front, or new requirements that emerge over time, is very limited in the SAP world.

SAP will post only those summary totals that it is configured to post at the time a transaction is processed.  No systematic facilities are provided for accessing or analyzing the rich detail that gets captured.  There is no framework for mass regeneration of summary totals that are posted in error due to configuration problems.  There is no general method for reorganizing data to reflect retrospective changes in account hierarchy relationships.  SAP effectively addresses problems with transaction processing integration.  But SAP does not effectively address integration problems related to high data-basis operational reporting.

Cookie Maker Builds Sales Reporting System: 1996

By Business, History, Technical

The Problem

In 1996 a major US cookie manufacturer installed SAP to manage all aspects of the manufacturing, HR and financial operations.  The system captured data and provided adequate information for managing most operational aspects of the business.  But SAP could not deliver the detailed sales analysis necessary to support the highly detailed, store-specific sales process.  They chose Geneva ERS to solve the problem in an integrated SAP environment.

The reporting requirements were substantial.  The company required detailed information at the customer/line item level over various time slices from daily to year-to-date, for over 60,000 customer stores and 2,000 items, selecting and summarizing by over 100 fields.  The company sells over a million items per week, and they needed comparative information for each time period for at least the last two years.  They also needed comparative data be reorganized to reflect changes in organizational structure over time!

The Solution

A Geneva ERS application was developed to deliver the full benefit of the SAP implementation, without burdening users with another tool.  SAP’s Profitability Analysis module was configured to capture the required data.  Programs were written to extract the sales items from SAP on a nightly basis.  A series of Geneva ERS processes were developed to manage the data warehouse, including creation of various summary file structures.  The completed data warehouse contains over 250 gigabytes of data in 15 entities, containing over 300 million records.

Geneva ERS was configured to produce executive information files, commonly called “information cubes.”   The company defined Geneva ERS “Views” to create over 150 cubes, and 75 interface files. Geneva’s effective date lookup processing allowed for recasting the historical detail to reflect the current management structures. 

The SAP Executive Report Viewer was also developed.  Constructed within the SAP environment and fully integrated with the Profitability Analysis reporting, this analysis tool gives users a single source of information.   Through this tool, users are able to select a time period, and then the respective cube.  Users can specify selection criteria, drill down into reports to lower and lower levels of detail.  They can export to Excel, view the data in a graphical format, format the report, and save a profile of the report.

The Results

Users have commented that the sales reporting system has been one of the most successful components of the SAP implementation.   It has allowed the company to capitalize on it national sales force that makes personal contact with stores nearly 3 times each week.  The sales representatives are armed with needed information.  They can analyze customer and product information for daily, weekly, monthly, year to date, and spot trends from over 3 years of data.  The system has helped fuel their growth in net income nearly 10 times from 1996 to 1999, and has scaled as the company has purchased and integrated three additional companies.

More about this time can be read in Balancing Act: A Practical Approach to Business Event Based Insights, Chapter 21. ERP Reporting.

State of Alaska Benefits from Data Warehouse High-Performance Solution: 1996

By Business, History, Technical

The following brochure described the Geneva Solution for the State of Alaska. The origins of Geneva started in the early 1980’s with a custom developed financial system for Alaska. In circa 1994 the newer version of the tool (version 3.0) was installed to enhance the reporting processes.

More about this time can be read in Balancing Act: A Practical Approach to Business Event Based Insights, Chapter 18. Input/Output.

Platform Assignment Principles for Decision Support Systems and Data Warehouses: 1996

By Business, History, Technical

by Richard K. Roth, Principal, Price Waterhouse LLP

Eric L. Denna, Ph.D., Warnick/Deloitte & Touche Faculty Fellow, Brigham Young University, Marriott School of Management

This is reprinted from the February 1996 issue of Manufacturing Systems (C) Chilton Publications, All Right Reserved. [More about this time can be read in Balancing Act: A Practical Approach to Business Event Based Insights, Chapter 16. Data Warehousing.]


The concept of Decision Support Systems (DSS) driven by data warehouses is a powerful one in our “data rich but information poor” world.  In any given situation, a wide range of computer hardware and software technologies potentially could be all or part of a solution.  To avoid solutions that are unnecessarily costly, or solutions that simply do not work, guiding principles for allocating functions to computing-platform technology are important. 

The wide range of architectural issues that must be addressed makes it easy to focus on some areas to the exclusion of critical considerations.  This article presents a framework for defining DSS/data-warehouse requirements, with corresponding principles for making computing platform assignments, that help avoid the pitfalls.  Some recommendations for getting started without getting derailed also are provided.

The Framework and the Problem

Decision Support Systems (DSS) can be thought of as having four basic components:

  1. Data Capture
  2. Data Storage and Retrieval Engine (data warehouse)
  3. Data Analysis
  4. Presentation Layer

The presentation layer very often gets the most attention at the outset of a DSS discussion or project.  It is the easiest to touch and feel and provides the easiest visualization of how problems could be posed and end results communicated.  However, the majority of the business purposes for a DSS are addressed by the first three components.  And the majority of the cost (85% or more in Bill Inmon’s experience[1]) is incurred in storing and accessing the detail data that drive the DSS applications — component #2.  An effective presentation layer is important, because DSS accessibility by business analysts without the routine intervention of technical personnel makes a DSS more useful for a broader range of problems.  But, for problems big enough to be worth solving, even the most cumbersome presentation layer is better than no solution at all.

In an extremely simple case, one piece of software (e.g., spreadsheet) running on one hardware platform (e.g., PC) could serve the requirements for all four components.  However, most real-world problems require an integrated solution consisting of several hardware and software technologies.  Choosing the right technologies depends on the nature of the problems being addressed.  And the problems vary along the lines of the four basic DSS components described above.  It is important to consider the requirements for each component separately, because lumping them under a single heading makes it difficult to choose between alternatives.

Data Capture

DSS applications may use data from a variety of internal or external sources.  In most cases, the primary and high volume data will come from operational systems used to run the business on a day-to-day basis.  The detailed data at the bottom of operational systems (captured at great cost then archived — by design) are the primary raw materials for understanding the patterns of interactions between suppliers, employees, business processes, customers, markets, products, services, pricing decisions, and financing alternatives.  

The architecture of operational systems makes them good sources for data to drive DSS applications, but the same architecture makes them terrible models for DSS components 2-4.  Operational systems tend to be functionally oriented and optimized for “getting today’s work off the table.”  Operational data tends to be very detailed dealing with individual or a series of related transactions that have a limited life-cycle, from an operational standpoint.   Management information system components of operational systems tend to support the limited-life-cycle clerical and management aspects of their respective narrow functional areas.  But they do not support a “retrospective” subject-area or enterprise view of transaction processing over time, which is the general goal for DSS. 

Data in operational systems is usually spread over multiple technology platforms (including paper files), and is maintained according to different update cycles and classification structures.  Part of developing a DSS includes rationalizing or converting the operational data so that it is comparable across sources where different aspects of the same or related underlying transactions are being represented.  Regardless of the technical platform or architecture for the DSS, the task of rationalizing data from various sources is the same, and it should not be underestimated.

It usually will be necessary to capture some new data to support the goals of a DSS initiative.  It is important to remember that capturing new information is inherently an operational problem, not a DSS issue per se.  As such, it should be treated as a problem for integration with existing operational systems, if practical or logical, or as a new operational system.  System capabilities to fulfill new data requirements should be designed using an operational application architecture appropriate for the business processes from which the new information will be captured.  A common mistake is to mix new data requirements that surface as part of defining a DSS environment with the requirements for components 2-4, which tends to bias (confuse) hardware and software platform choices.

Data Storage and Retrieval Engine (data warehouse)

Following Bill Inmon’s model for a data warehouse[2], data is maintained in four major classes:

  1. Current detail
  2. Old detail
  3. Lightly summarized
  4. Highly summarized

Current and old detail corresponds most closely to the detailed transaction-level data from the operational systems discussed above.  Lightly summarized data amount to reductions of the detail appropriate for the general requirements of a particular department’s DSS applications.  Highly summarized data amount to further reductions of the lightly summarized data for specific DSS modeling requirements.  The primary difference between each of the four classes of data is VOLUME.  And volume is the primary (although not only) factor that determines the appropriate technology platform for storing and retrieving data in a warehouse.

Classes 1 and 2 represent the highest data volumes and the richest source of information.[3]  Since the volume of detail data is the primary cost driver for data warehouses, the largest impact on the cost (or cost avoidance) of a DSS environment can be achieved by making the right technology platform choices in this area.

Current detail usually is thought of as being roughly two years of history and maintained on magnetic disk devices.  Old detail is roughly assumed to be up to 10 years of history and maintained on tape or other low cost storage devices.  Depending on the scope of the warehouse, nature of the business and size of the enterprise, volumes in the multi-million/multi-billion record range should be anticipated for annual additions to enterprise-wide detail data.  Storage space required can range from hundreds of gigabytes to terabytes.

Lightly summarized and highly summarized data can vary greatly in degree of volume reduction from the corresponding detail.  Some DSS applications require routine reprocessing of large subsets of the detail (customer scoring applications would be an example).  Probably most applications, however, can be satisfied with a very small subset of the data summarized at one or two orders of magnitude of reduction (modeling financial performance for individual products or categories of products over time would be an example).  Because of the potential volume reductions, the platform options open up substantially when dealing with lightly and highly summarized data.

For any large volume of data (large could be as small as a few hundred thousand rows or as large as a few billion rows), the nature of the processing required, in combination with the absolute number of rows/records, is the driver for platform choice.  Three basic hardware architectures and two basic file organizations are available:

1.    Hardware architectures

A.   Mainframe — Offers very high data bandwidth (4.6-17 megabytes per second on each of up to 256 channels per machine) in a shared memory, parallel processing environment.  Excellent architecture for situations where large amounts of data on disk or tape must be evaluated for selection or re-sequencing purposes.  Excellent environment for situations where enterprise-wide access is important.  Drawbacks are:  1) the relatively high cost of disk storage (presently about 4:1 — but coming down quickly — compared with server technology) and processing power (MIPS); and 2) the relatively thin market for “friendly” presentation layer tool choices.  This environment is often referred to as “data rich but MIPS poor”.

B.   Server — Offers cheap processing power (MIPS) and memory, but relatively low data bandwidth (0.1-1.25 megabytes per second with only one channel per machine)[4] in a single processor environment.  Excellent for situations where all data can be loaded into memory before processing (e.g., personal applications like spreadsheets, word processing and graphics) or where relatively small amounts of data must be accessed for selection or re-sequencing purposes.  The cheap processing power and memory make it cost-effective to consume these resources in the long instruction path lengths necessary for implementing “friendly” graphical user interfaces.  As a result, the market is relatively rich in “friendly” presentation layer tool choices.  This environment is often referred to as “MIPS rich but data poor.”

C.   Symmetric Multi-Processors (SMP) — Uses cheap server technology running in a parallel processing environment (currently 4-20 processors per machine) to scale up the power of server technology described above.  Excellent environment for accessing large stores of data on magnetic disks through indices where instruction path lengths are long but the amount of data that must be moved from the disks to the processors for selection or re-sequencing purposes is relatively small.

2.   File organizations

A.   Sequential — Primary method for storing old detail in operational systems and warehouses.  Also, the primary file structure used for transferring data from various platforms to the data warehouse and between subsystems of a DSS environment.  Depending on the configuration, a mainframe can read sequential files at a rate of 1 to 20 or more gigabytes per minute.  Server technology can read sequential files at rates of .002-.1 gigabytes per minute. a difference of 1 to 4 orders of magnitude, depending on specific machine architectures and configurations.  Excellent method for storing data warehouse information where many data elements can serve as the primary selection criteria and no practical amount of indexing will materially reduce the need for sequential table scans if stored in a relational file organization

B.   Relational — Primary method for storing operational data to support on-line transaction processing.  Also, the preferred method for storing lightly, and especially, highly summarized data in a warehouse application where usage tends to result in one or a few data elements being the primary selection criteria.  The primary advantage of this file organization is that relational data can be accessed through SQL, which is the primary method through which most user “friendly” tools communicate with data warehouses.  Disadvantages are that random access to data is very slow — on the order of 6-12 megabytes per minute on mainframe or server technology.  In addition, the instruction path length is very long for resolving SQL instructions, which consumes substantial processing power (MIPS).  As a result of the long SQL instruction path lengths, data warehouse applications where relational organizations are appropriate tend to be implemented on server or SMP technology where cheap MIPS are available.

In summary, platform choice for the storage and retrieval engine depends on how much data there is and what will be done with it.  Current and old detail tends to be high in volume and used to feed specialized departmental and individual warehouses.  Detail data often will be accessed sequentially, because many elements can be used as the primary selection criteria. Therefore, mainframes tend to be good platforms for the detail portions of data warehouses.  Departmental and individual warehouses tend to be lower-volume reduced answer sets of the detail data.  These warehouses often are created to support a particular focus of analysis and, consequently, tend to be good candidates for relational storage and access through indices.  Consequently, server/SMP technologies can be good platforms for the departmental and individual portions of data warehouses.

Data Analysis

Once a particular DSS problem is formulated, one or more analytical models may be appropriate.  The most common DSS problem requires selecting an answer set from the warehouse that meets specified criteria.  Then, according to other criteria, the preliminary answer set is augmented, transformed, re-sequenced, summarized/counted, formatted and displayed as the result.  The result is usually in the form of a cross tabulation or often a single line.  Basically, this kind of analytical framework is known as data reduction or, more simply, reporting.  The storage and retrieval component of the data warehouse always has a reporting framework of some competency as an integral part of its functionality.  However, the requirements of any particular DSS problem may require data reduction/reporting functions not available directly from the storage and retrieval engine. 

Separate reporting and data analysis facilities may be necessary to accomplish particular objectives.  More advanced data analysis requirements would include statistical and econometric models that employ regression, factor analysis or other trend, pattern recognition and exception analysis techniques.  For small amounts of numerical data, predictive models can be implemented using work-station-based neural network technology.  For large amounts of data with complicated business rules, the mathematics of fractal geometry can be employed to deal with data compression issues to facilitate storage and processing, usually on a mainframe.  Using these kinds of techniques almost always requires transferring an answer set provided by the storage and retrieval engine’s reporting mechanism to a separate software package (which may be on a different platform).

It is tempting to bundle the storage and retrieval engine with the data analysis component if DSS is viewed as primarily a reporting problem.  However, thinking about the data analysis problem in a broader context clarifies the difference between the two components and demonstrates the importance of separately defining DSS requirements for each area.

Presentation Layer

The presentation layer is the mechanism through which a DSS user supplies the criteria, parameters and instructions that the DSS components use in their processing. Also, the presentation layer is the delivery mechanism for results produced by the DSS programs. 

The image that comes to mind of a DSS presentation layer includes a powerful workstation configured with a high resolution color monitor running a graphical user interface (GUI) displaying charts and numbers that can be traversed using point-and-click drill-down features.  Ideally, the presentation layer would make use easy for non-technical people and would not require that they have a detailed understanding of the underlying data in the warehouse.  The user interface would be intuitive so that learning and use without instruction manuals is facilitated.  Further, it would be good if it provided fully integrated and seamlessly automated control over all DSS components, from access to data captured by operational systems through delivery of query results.  Complete user control over DSS facilities from the presentation layer would mean that technical MIS personnel would not need to participate in operation of the system. 

Notions of the “ideal” notwithstanding, it is important to remember that the result of a DSS query may be one or a few numbers.  It may be an automated file to be used in some other application.  Frequently, the output will be a lot of numbers that are more useful if printed in rows and columns on paper.  Where non-technical users are concerned, they are more resourceful than they generally are given credit for.  People with real problems to solve can deal with an interface more complicated than just anyone could use. 

Reality is that very few things are totally integrated and automated — the current state of the art in DSS presentation layers in particular and information technology in general certainly are good examples.  One reason is that increasing levels of integration and automation lead to increasing complexity and cost that simply is not worth it in many cases.  Another reason is that DSS presentation layer technology is relatively new.  We are a long way from the day when users who do not know what to ask for can “just get what they want” through a holistic interface.  And MIS people will be in the middle of non-trivial DSS applications because the hardware and software components that make up the systems will remain a complicated specialty for the foreseeable future.

The primary reason for building a DSS is to provide a mechanism for understanding the patterns buried in otherwise inaccessible data.  The objective is not to use client/server technology, give analysts a GUI, use SQL everywhere or automate and integrate everything in sight.  Operational systems are capturing huge amounts of information at great cost that is potentially valuable but not being used.  DSS applications that exploit that information can be developed relatively quickly as long as tools available are applied with common sense in proportion to the nature of the problems being addressed. 

Presentation layer characteristics are probably the most diverting technical distractions that stop DSS solutions from getting implemented, or even started.  Presentation layer desires often lead to platform decisions that shoehorn DSS functionality into technology not fit for the purpose — the computer equivalent of building skyscrapers out of tinker toys. 

Requirements for the presentation layer should be defined after addressing the first three DSS components described above.  This minimizes the tendency to get distracted from the primary objective and the temptation to allocate functionality to sub-optimal platform technologies.

Getting Started

Most executives are inherently comfortable with the notion of a data warehouse and DSS. It seems logical that all the money spent on capturing detail transaction-level data ought to pay off for something more than filling orders, collecting receivables and getting the bills paid.  The idea that there is something to be learned from a thoughtful analysis of all that history is a natural one.

The first hurdle usually comes when trying to cost justify the first DSS application.  The history of MIS in general is one great in expectations, high in cost, fraught with problems, late on delivery and mediocre in results.  The comfortable idea of a DSS sometimes becomes an anxious thought once a specific proposal gets formulated.

Anxiety notwithstanding, many companies have found that this is an idea that actually works.  It is almost certain that some fundamental assumptions about customers, suppliers, markets, etc. will not be supported by the historical record.  DSS applications can be the catalyst for identifying and justifying fundamental changes in strategy or tactics — Product services organizations have fundamentally changed their maintenance agreement packages and pricing; manufacturers have identified subtle causes for out of bounds product failure rates; direct mail operations have quit sending promotions to unresponsive customer segments; financial institutions have developed radically more lenient credit exception policies; and retailers have been able to validate (or not) their view of fashion trends.  It usually proves worth doing if the development pitfalls are avoided.

To get over the first hurdle and avoid the pitfalls, we recommend the following:

1.   Pick a problem that is worth solving for which virtually all of the data necessary to drive the data analysis is available from one or a small number of existing operational systems.  Do not burden the DSS project with a non-trivial operational system re-engineering project.

2.   Prove the value of doing retrospective data analysis for the problem on the table.  The data analysis component is where the value is delivered.  The other three DSS components represent infrastructure that makes it possible to roll out DSS on a larger scale.  Do not worry about metadata for source systems or the warehouse.  Do not worry about the long run architecture for the enterprise-wide detail warehouse.  Do not worry about presentation-layer standards.  View this problem as an individual or small departmental data warehouse object case.  Get a hard dollar cost/benefit success for DSS that will relieve some anxiety in downstream cost justification proposals.

3.   Define the requirements and allocate functionality to hardware and software platforms according to the principles outlined above for the four major DSS components.  It probably will not be possible to satisfy the requirements on one PC and a spreadsheet, but keep the number of platform technologies to a minimum consistent with the principles. 

4.   Constrain the requirements by the existing hardware and software portfolio in place where possible.  Do not take on new learning curves unnecessarily.  Avoid getting distracted by technology at all costs.

[1]The Computer Conference Analysis Newsletter, May 18, 1993 n319 p12(1)

[2]Inmon, W.H., Building the Data Warehouse, New York:  John Wiley & Sons: 1993, p. 33.

[3]Roth, R.K., and Denna, E.L. (1994) “A Summary of Event-Driven System Concepts,” Price Waterhouse LLP, White Paper

[4]Theoretically, disk I/O bandwidth could be 8-20 megabytes per second, depending upon bus architecture, however, achievable throughput is dramatically lower as indicated.

Original Images

These are the original images of the report.

Practical s/390 Parallel Processing Option for Data Warehouses: 1995

By History, Technical

by Richard K. Roth, Principal Price Waterhouse LLP, Sacramento, California

November 8, 1995

[More about this time can be read in Balancing Act: A Practical Approach to Business Event Based Insights, Chapter 20. Parallelism and Platform.]

Why Parallel Processing?

Data warehouse applications differ in a basic way from traditional transaction processing and reporting applications:

  • Transaction processing applications usually process a day’s or a month’s activity, then archive the detail and clear out the files to prepare for the next day’s or month’s activity.  Reporting functions for these applications are primarily limited to supporting the operational and clerical aspects of the narrow functional area addressed by the application for just the days or months that are open.  Even for “high” volume applications (e.g., call detail billing) the volume never really gets “large” because of the archiving process.
  • Data warehouse applications tend to deal with “large” volumes of data since their primary purpose is to perform a retrospective analysis of the cumulative effects of transaction processing over time.  Some uses for data warehouses have similar characteristics to transaction processing applications in that specific requests may only require access to small amounts of data
    (e.g., an individual customer’s profile and particular month of call detail activity).  However, most analytical uses of data warehouse information require scanning large partitions of the warehouse to identify events, or customers, or market segments that meet particular criteria.   It is in the scanning kinds of uses where the basic architecture for building transaction processing systems breaks down when applied to data warehouse applications.

The basic architecture for most transaction processing applications relies on indexed access to a few records in large files for purposes of validation or update (DB2 primarily is an indexed access method).  The advantage is that if  you only need to access or update a few records, and you can find the records needed through an index, a lot of time can be saved by touching only the records required rather than searching through all the records just to find the few that you want.  In general, indexed access works at about 200,000 bytes per second (about an hour and a half per billion).  So, as long as the transaction processing or reporting involved only requires that single billions of bytes be touched in the process once a processing cycle, indexed access is sufficiently fast for the purpose.

Data warehouses, however, can have tens, hundreds, or thousands of billions of bytes (gigabytes).  In addition, the more detailed the data warehouse, the more difficult (impractical) it is to keep indexes on all the fields that might be needed to find the records.  Consequently, using indexed access methods to process data warehouses could take hours or days, depending on size and arrival rate of processing requests.  And the practical effect in many cases is that, under the covers, the indexed access method ends up doing a sequential scan of the files.

Sequential access methods are the basic alternative to indexed access methods.  The advantage of sequential access is that it is very fast (1-20 million bytes/second depending on hardware and software configuration) compared to indexed access.  Since much of the processing that goes on with a data warehouse is sequential anyway, it makes sense to take advantage of this speed where possible.  Further, if the processing can be organized so that multiple sequential table scans can be initiated in parallel[1] rather than serially, enormous amounts of data can be scanned in a relatively short period of time (at 4 million bytes/second, a gigabyte can be read in about 4 minutes; if 40 of these sequential threads were started in parallel, half a trillion bytes could be read in an hour).  The mathematics of parallelism really add up for data warehouses.

What are the practical alternatives for implementing parallel processing applications?

There are two basic approaches to achieving parallelism in processing:

  • Specialized hardware architectures (i.e., massively parallel processing “MPP” computers) can be used that initiate parallel scans on separately paired CPU and disk assemblies (known as “shared nothing environments”).  Teradata and Tandem are examples of two companies that manufacture computers in this class.
  • Software can be used to initiate multiple parallel scans on computers that share a common pool of disk and tape devices (known in two groups as mainframes and symmetric multi-processor “SMP” machines).  IBM, AMDAHL, and Hitachi are companies that manufacture mainframes.  Hewlett Packard, Silicon Graphics, and Sequent are examples of companies that manufacture SMP machines.

Until recently, the MPP hardware approach was almost always the most cost-efficient way to achieve parallelism for data warehouse processing.  MPP machines are manufactured using CMOS processor technology and work with disk drives that historically have been less costly than a corresponding mainframe configuration.  In addition, the only software available for the mainframe that really exploited its parallel processing, multi-tasking, high data bandwidth capabilities were CICS, DB2 and JES, none of which are particularly useful for initiating multiple parallel sequential table scans.  Implementing the software alternative on a mainframe platform required developing custom systems-level software for the purpose, which some large organizations facing this problem have done.  SMP machines continue to be designed primarily for on-line processing and characteristically demonstrate very low data bandwidth in practice, which makes them generally unsuitable for any large application of this type.

MPP hardware is a workable solution, but it also presents some technical and economic challenges.  In many cases, the high volume data that will populate the warehouse already is resident on the mainframes where it was captured in the first place.  Duplicate copies of all event, reference, and code-set files must be maintained in both environ-ments.  Interactive applications often have to be maintained in both environments to access the same basic data.  Businesses with the fixed cost of a large mainframe infrastructure behind them are reluctant to take on the burden of a second environment, including new skills require-ments.  And it costs millions of dollars to buy capacity sufficient for making batch processing windows, which for the most part sits idle because of the limited purpose nature of the technology.

Over the last several years, however, mainframe CPU, disk, tape, and software prices have come down dramatically.  In the early 1990s, mainframe CPU power cost over $100,000 per MIPS (million instructions per second).  Today, mainframes are being made using the cheaper CMOS technology, and mainframe CPUs can be purchased in the $14,000 per MIP range.  High performance tape and disk devices are available at comparable prices regardless of platform.

Price Waterhouse also has developed parallel processing software that permits parallel sequential table scans to be initiated in mainframe data warehouse applications without requiring systems programmer intervention.  This software is known as Geneva V/T  and generally offers 10-50 fold CPU consumption and throughput gains over baseline COBOL and DB2 techniques.  The combination of a more than five-fold reduction in price of the basic hardware and a ten-plus factor increase in throughput and compute efficiency from the Geneva V/T  parallel software has turned around the basic cost/performance calculations in favor of the mainframe for these types of data warehouse applications.

How can you tell which approach is best for a given situation?

Evaluating a mainframe software approach first for exploiting parallel processing offers some substantial advantages.  Geneva V/T  can support processing from tape as well as disk, which means that historical event data can be accessed directly from existing datasets without the need to buy additional hardware just to get started.  This permits conducting a mainframe proof-of-concept demonstration to verify that the data being considered for warehousing truly is valuable for resolving pressing problems and to prove the processing metrics for the case at hand in the existing production environment.

Assuming that access to the data proves to be truly valuable, use in an operational context may necessitate the purchase or allocation of additional compute and disk/tape resources.  But, because new resources simply will be added to the existing SYSPLEX configuration in most cases, peak load requirements probably can be accommodated through scheduling rather than outright purchase of excess capacity necessary to get through the humps.  This most likely will result in a substantially lower overall net hardware purchase compared with what would be required if an MPP hardware approach were adopted.  In addition, during periods when high volume data warehouse files are not being processed, whatever additional hardware is purchased is available for other applications experiencing peak-load demands.

What does a top-down picture of this mainframe environment look like?

An graphic view of what we refer to as the MVS Operational Data Store Architecture is provided in the diagram below:

DB2 provides a complete relational database environment for transaction processing applications and intermediate volume on-line and batch extract/reporting processes.  Where high volume parallel processing techniques are appropriate, Geneva V/T  provides the mechanism for accessing on-line DB2 and VSAM files in place and sequential files on either disk or tape.  Where interactive queries are required, DB2 provides the framework for interactive execution.  Where volume problems dictate minimizing the number of times datasets are accessed, Geneva V/T  provides for a generalized single pass architecture where multiple queries are resolved simultaneously in one pass of the data.

The Geneva V/T  disk or tape option is especially important since new tape technology is now available with 30 gigabyte (compressed) capacity per volume and I/O transfer rates in the 20 megabytes per second range (most high performance disk transfer rates are in the 6 megabytes per second range).  This means that for high volume event files where sequential processing is the preferred method, dramatically cheaper tape becomes the preferred media for storage, not just a compromise forced by economics.  This can offer significant potential for dramatic savings over an MPP hardware approach that would dictate DASD-based secondary storage.

Recently released new features of MVS also make the mainframe environment the most attractive choice as the backbone server for a client/server network.  Especially important facilities are the LANRES and LANSERVER MVS functions.  Files generated by selection functions, either directly from DB2 or Geneva V/T, can be stored in a common workstation-oriented file repository on mainframe attached DASD or automatically downloaded to designated servers on the WAN/LAN.  EBCDIC/ASCII translations are performed automatically and NetWare users can access mainframe storage devices just like any other NetWare disk volume.  Options for accessing mainframe files through the Internet are even available.

*          *          *

In summary, the mainframe world has changed.  Instead of being something to run away from because of cost, absence of software, or lack of connectivity and GUI options, it now is in principle the server platform of choice.  No doubt there will be many situations where other server or specialty processor platform choices will be appropriate.  But, instead of “getting off the mainframe” being the going in position, “exploiting the already existing mainframe infrastructure” should be the common sense point of departure when evaluating technical platform alternatives for parallel processing problems.

[1]DB2 does take advantage of some parallel processing techniques under certain circumstances.

Original Image

The following are images of the original paper.