Thursday, February 16, 2023

Stop Before Transforming Your Data Yet Again

Assessing the performance of the SQL and queries running on your IBM i systems is the most popular service that our Db2 for i team in Systems Expert Labs performs. One of those things that I frequently notice when doing this work is that a lot of a client’s query and reports perform data transformations. These data transformations are required to get the data in a form that the end users can understand and digest. These transformations range from converting legacy data values into SQL dates to eliminating leading blanks in character columns to the aggregation of data. 

The transformations themselves are not an issue. The issue is the number of times that they are done and the number of queries that contain the same transformation code. As a result, these repeated transformations leave clients with both performance and code maintenance headaches.

This transformation overhead reared its head on a recent client engagement where they were experiencing system performance issues due to reporting workloads straining their system. Our analysis showed that a significant number of reports referenced the same SQL view. One transformation in this view was the aggregations of millions of rows of sales data. This same expensive data transformation was occurring hundreds of times of day on their system.  As it turned out, the analysts consuming these reports did not need the latest sales data being aggregated for their work.  The client addressed their performance woes by scheduling the needed aggregation of the sales data and changing the SQL view to reference the pre-aggregated data set. System performance improved dramatically after this repeated data transformation was reduced to a single daily event. After this change, their Analytics SME remarked that the performance change of our reports was quite literally astounding.

This client avoided the code maintenance burden of having the same data transformation code duplicated across many queries through the use of an SQL view, but I see many clients with the same data transformation code repeated in different queries. While the use of SQL views enables code reuse, they don’t address the performance overhead of having multiple queries performing the same data transformations over and over again on your system. 

Again, you could free up a lot of your system resources by reducing the number of times that the data transformations have to be performed. Schedule these data transformations once a day (or couple times) to get a copy of your data into a format that is more easily consumed by your reporting and analytic workloads. This data copy that enables easier and faster analytics is often referred to as a data mart or data warehouse.

There’s a good chance you have heard about data marts and data warehouses but I’m guessing you haven’t heard of DataMigrator. This tool which is part of the Db2 Web Query product suite can help generate the SQL for your transformations like aggregations which are commonly used in analytical applications and schedule the transformations to run.  It even allows you to just copy the changed data in your tables with it’s support for journals and remote journals as a source. You can learn more about DataMigrator, motioning your data, and analytics in general from a new blog started by my teammate John Westcott. Check it out and learn more about how you can avoid the costs of those repeated transformations on your system.

2 comments:

  1. Dear Kent

    When you mentioned pre-aggregated data set, I was thinking of using Materialize Query Table (MQT). Is MQT the most efficient tool to use in this case?

    ReplyDelete
    Replies
    1. MQTs are definitely an option for pre-aggregating data sets. How much you can benefit from the automatic rewrite of queries to use the MQT by the optimizer depends on your data currency requirements. There's a white paper on the MQT support located at this site: ibm.biz/db2iPapers

      Delete