Monday, August 15, 2022

The Right Way to Compare Apples & Oranges with SQL

 Conventional wisdom says that you shouldn’t compare apples and oranges and that’s also a best practice you should be following when programming with SQL. Avoiding apples and oranges comparisons with SQL comes down to coding search predicates (CompanyNameCol = 'IBM') where the two values being compared are the same in terms of data type and precision. 

In the cases where the search value differs from the column definition, SQL provides a rich set of built-in functions to make it easy to turn an apples and oranges comparison into an apples-to-apples comparison. When using these built-in functions on comparisons, you need to be aware that there is a right way and a wrong way to using these functions. Both ways will functionally work, but the wrong way will result in subpar performance and more system resources being consumed. 

A couple of recent SQL performance reviews with clients revealed that there are way too many developers using built-in functions the wrong way. The following example contains a WHERE clause found in a recent Technology Services engagements where SQL built-in functions are used the wrong way on search predicates to make sure that like values are being compared (i.e., apples to apples)

SELECT * FROM SomeTable    
WHERE TO_NUMBER(TO_CHAR(Last_Change_TS, 'YYYYMMDDHH24MI'))
                                > 202205101702.000000000000000
          AND TO_NUMBER(TO_CHAR(Last_Change_TS, 'YYYYMMDDHH24MI'))
                               <= 202205102102.000000000000000

The Last_Change_TS column was defined in the client’s table with the timestamp data type, so the SQL request was coded to utilize a combination of SQL built-in functions to convert the timestamp value into a numeric value that is directly comparable with the numeric search value. 

While these search predicates generated the correct result set, performance was not acceptable due to how the functions were utilized. Placing the functions on the left-hand side of the comparison means that Db2 must convert every timestamp value in the Last_Change_TS column into a numeric value before the comparison can be made with the search value. The client’s table contained over 200 million rows, so that means this numeric to timestamp conversion was being performed over 200 million times each time the query was run. All these conversions consumed a significant amount of CPU resources on their system.

Not only does this “wrong way” SQL coding slow performance with CPU intensive data conversions - to make matters worse, it prevents the Db2 query optimizer from using a “normal” index to quickly filter and return the rows that meet the specified search criteria. The client’s queries were often returning less than 100 rows, but the poor SQL coding practice was forcing all of the rows to be searched.  An index with a leading key of LAST_CHANGE_TS would limit the query processing to only the rows meeting the search criteria.

Now, you might be thinking that I could create a derived key index to address the performance issues of this “wrong way” SQL coding. While that is a viable consideration, there are a couple of items to think about. First of all, a derived key index cannot be created for this example query. Not all SQL built-in functions are created equal - the TO_CHAR function is implemented as a Db2 user-defined function which means that TO_CHAR cannot be referenced on an index key definition. Second, even if the functions were allowed on the key definition – you’re creating a derived key index that may only provide performance benefits to a small number of queries on the system. Third, you’re paying the cost of that data conversion each time that a derived key index is maintained on the system. 

It’s okay to create derived key indexes to improve query performance, but I’d recommend saving those for situations where the SQL statement cannot be changed or it’s very difficult to change.

So, the easiest way to remember to use SQL functions the “right way” on search predicates is to always code the SQL functions on the right-hand side of the predicate. The following shows how the client query was rewritten to utilize the functions to make the right-hand side of the predicate directly comparable:

SELECT * FROM SomeTable    
WHERE Last_Change_TS > TIMESTAMP(CHAR(202205101702))
          AND Last_Change_TS <= TIMESTAMP(CHAR(202205102102))

Query performance improved dramatically because the numeric to timestamp conversion was performed twice instead of millions of times and a “normal” index over the Last_Change_TS column was used to quickly find the rows meeting the search criteria.  For best practices on indexing, check out our indexing strategy white paper.

Hopefully, you now have a good understanding of the right way and wrong way to use SQL functions to convert an apples to oranges comparison to an apples-to-apples comparison.  And by the way, if you’ve never played the Apples to Apples game, it is a fun game to play at parties and family gatherings.