Friday, December 9, 2022

Confounding and Constricting Concatenations

 This month’s title is a bit of a tongue twister, but the title does capture a problematic SQL coding practice with concatenations that has been popping up in recent SQL Performance Assessments performed by our IBM Technology Services team. Speaking of tongue twisters, did you know that MIT researchers developed the toughest tongue twister in the world of “Pad kid poured curd pulled pod”? I have no idea what this phrase means, but I do know about the confusion and poor performance that can be caused by suboptimal usage of the SQL concatenation support. 

Let’s start by looking at the concatenation coding practices that we’ve been finding when analyzing customer’s SQL requests. The following two SELECT statements contain the subpar coding pattern of comparing the concatenation of two column values with the concatenated result of two other columns. This concatenated comparison shows up on the join condition of the first query and the WHERE clause of the second query.

SELECT * 
  FROM tab1 a INNER JOIN tab2 b
   ON a.c1 || a.c2 = b.c1 || b.c2


SELECT * FROM tab1 a
   WHERE CONCAT(a.c1, a.c2) IN
           (SELECT CONCAT(b.c1, b.c2) FROM tab2 b)

 

The alternative to the concatenated column comparisons is comparing the individual columns and then logically ANDing the results of those individual comparisons (e.g., a.c1 = b.c1 AND a.c2 = b.c2). When comparing the two approaches, I believe that the individual column comparisons are easier to read and understand. While this opinion can be debated as programmer preference, there’s no debating that the concatenated column comparison will result in slower query performance.

 

There are a couple of reasons that this concatenation coding practice is a poor performer. First of all, comparisons that include an expression or derivation severely limit the query optimizer’s ability to estimate how many rows in the table will be selected or processed. When this number of rows estimate generated by the query optimizer is inaccurate, there is a greater chance for poor query performance. Second, the concatenated columns comparison prevents the use of normal indexes to speed up the query execution. Indexes are often the fastest way to perform the specified column comparison, but normal indexes are not an option due to the concatenated expressions in these two examples. Yes, a derived key index could be created that includes the concatenation in the column example. However, there’s a good chance that you already have a normal index created over the columns being concatenated. 

 

Now that you understand the drawbacks of the concatenated column comparisons, let’s look at the optimal way of coding these two example queries. As you can see, the join query can be rewritten in one of two ways to improve performance and readability.

SELECT * 
  FROM tab1 a INNER JOIN tab2 b
   ON a.c1 = b.c1 AND a.c2 = b.c2

SELECT * 
  FROM tab1 a INNER JOIN tab2 b
   ON (a.c1, a.c2) = (b.c1, b.c2)


The second join query utilizes a row value expression to simplify the join condition by removing the need to include the logical AND operator. These two queries will perform exactly the same, so this is strictly a programming style preference when deciding which syntax to use.

 

The rewrite of our second example SELECT statement also uses the row value expression syntax to eliminate the concatenated column comparison. 

SELECT * FROM tab1 a
  WHERE (a.c1, a.c2) IN 
   (SELECT b.c1, b.c2 FROM tab2 b)

Like many programming languages, SQL offers more than one way to get your work done. While this flexibility can be a good thing, you have also now learned that not all SQL solutions are created equal in terms of performance.

This is my last entry for 2022 - have a Merry Christmas & wonderful holiday season and we’ll talk in 2023!