Conventional wisdom says that you shouldn’t compare apples and oranges and that’s also a best practice you should be following when programming with SQL. Avoiding apples and oranges comparisons with SQL comes down to coding search predicates (CompanyNameCol = 'IBM') where the two values being compared are the same in terms of data type and precision.
In the cases
where the search value differs from the column definition, SQL provides a rich
set of built-in functions to make it easy to turn an apples and oranges comparison
into an apples-to-apples comparison. When using these built-in functions on
comparisons, you need to be aware that there is a right way and a wrong way to
using these functions. Both ways will functionally work, but the wrong way will
result in subpar performance and more system resources being consumed.
A couple of
recent SQL performance reviews with clients revealed that there are way too
many developers using built-in functions the wrong way. The following example
contains a WHERE clause found in a recent Technology Services engagements where
SQL built-in functions are used the wrong way on search predicates to make sure
that like values are being compared (i.e., apples to apples)
SELECT * FROM SomeTable WHERE TO_NUMBER(TO_CHAR(Last_Change_TS, 'YYYYMMDDHH24MI')) > 202205101702.000000000000000 AND TO_NUMBER(TO_CHAR(Last_Change_TS, 'YYYYMMDDHH24MI')) <= 202205102102.000000000000000
The Last_Change_TS
column was defined in the client’s table with the timestamp data type, so the
SQL request was coded to utilize a combination of SQL built-in functions to
convert the timestamp value into a numeric value that is directly comparable
with the numeric search value.
While these
search predicates generated the correct result set, performance was not
acceptable due to how the functions were utilized. Placing the functions on the
left-hand side of the comparison means that Db2 must convert every timestamp
value in the Last_Change_TS column into a numeric value before the comparison
can be made with the search value. The client’s table contained over 200
million rows, so that means this numeric to timestamp conversion was being
performed over 200 million times each time the query was run. All these conversions
consumed a significant amount of CPU resources on their system.
Not only does
this “wrong way” SQL coding slow performance with CPU intensive data
conversions - to make matters worse, it prevents the Db2 query optimizer from
using a “normal” index to quickly filter and return the rows that meet the
specified search criteria. The client’s queries were often returning less than
100 rows, but the poor SQL coding practice was forcing all of the rows to be
searched. An index with a leading key of
LAST_CHANGE_TS would limit the query processing to only the rows meeting the
search criteria.
Now, you might
be thinking that I could create a derived key index to address the performance issues of
this “wrong way” SQL coding. While that is a viable consideration, there are a
couple of items to think about. First of all, a derived key index cannot be
created for this example query. Not all SQL built-in functions are created
equal - the TO_CHAR function is implemented as a Db2 user-defined function
which means that TO_CHAR cannot be referenced on an index key definition.
Second, even if the functions were allowed on the key definition – you’re
creating a derived key index that may only provide performance benefits to a
small number of queries on the system. Third, you’re paying the cost of that
data conversion each time that a derived key index is maintained on the
system.
It’s okay to
create derived key indexes to improve query performance, but I’d recommend
saving those for situations where the SQL statement cannot be changed or it’s
very difficult to change.
So, the easiest way to remember to use SQL functions the “right way” on search predicates is to always code the SQL functions on the right-hand side of the predicate. The following shows how the client query was rewritten to utilize the functions to make the right-hand side of the predicate directly comparable:
SELECT * FROM SomeTable WHERE Last_Change_TS > TIMESTAMP(CHAR(202205101702)) AND Last_Change_TS <= TIMESTAMP(CHAR(202205102102))
Query
performance improved dramatically because the numeric to timestamp conversion
was performed twice instead of millions of times and a “normal” index over the
Last_Change_TS column was used to quickly find the rows meeting the search
criteria. For best practices on
indexing, check out our indexing strategy white paper.
Hopefully, you
now have a good understanding of the right way and wrong way to use SQL
functions to convert an apples to oranges comparison to an apples-to-apples
comparison. And by the way, if you’ve never
played the Apples to Apples game, it is a fun game to play at parties and family
gatherings.
Nice, easy to use tip! Code review coming, thanks!
ReplyDelete