Monday, September 16, 2024

Cache Identity If You Can

 This entry’s title is derived from the Catch Me If You Can movie – a good one if you haven’t seen it yet. This movie  is based on the true story of a con man who takes on the identity of a pilot, doctor and lawyer as he steals millions of dollars by cashing fraudulent checks.  You may be wondering how a story like this relates to Db2 for i since none of the con’s fake identities were IT professions like programmer or database engineer. However, Db2 for i database tables can take on an identity in the form of an identity column.

 An identity column allows you to have Db2 automatically generate sequential numeric values for the column as new rows are inserted into the table. Thus, identity columns are ideal to use as a surrogate primary key or natural key columns like order number. Here’s a simple example an employee table using an identity column to generate the employee number as new employees are inserted into the table.


CREATE TABLE employee (
     empno INTEGER GENERATED ALWAYS AS IDENTITY
           (START WITH 10 INCREMENT BY 10),
     name CHAR(30), deptno CHAR(4))

 

The prior example didn’t include the CACHE setting for the identity column which defaults to 20 when the CACHE option is not specified. Here’s the updated version of our simple example to include the default setting for the CACHE option.


CREATE TABLE employee (
     empno INTEGER GENERATED ALWAYS AS IDENTITY
           (START WITH 10 INCREMENT BY 10 CACHE 20),
     name CHAR(30), deptno CHAR(4))


The default cache setting of 20 results in Db2 generating the next 20 values for that column and storing those generated values in memory. As new rows are written to this employee table, the generated values for the identity column are taken from the cached values in memory. This means that after 20 rows are inserted into the table, Db2 will have to generate and cache another 20 values in memory.

 

If the table with the identity column has a low velocity rate of inserts, then the default cache of 20 generated values will perform pretty well.  However, what if the table has a high insert velocity?  That means that in the case where the table is the target of a high rate of inserts, Db2 will have to pause after 20 rows are inserted to generate values for the next 20 rows. If you have a batch process that inserting hundreds of thousands of rows like one of our recent IBM Technology Expert Labs clients, pausing after every 20 rows is not very efficient.  The client that our IBM Technology Expert Labs team assisted was able to speed up the writes of all those rows to the table by increasing the identity column cache size to 1024. 

 

A larger cache size can be specified using the CACHE clause when a table is initially being created or after the fact with an ALTER TABLE statement like the following:


ALTER TABLE employee   
  ALTER COLUMN empno
    SET GENERATED ALWAYS AS IDENTITY
    (START WITH 10 INCREMENT BY 10 CACHE 100)

 

The obvious question to ask at this point is what’s the downside of using a larger cache size. One tradeoff is memory. However, most IBM i servers have plenty of memory to use these days … so you shouldn’t worry too much about the memory needed to store 20 generated numeric values versus 100 numeric values. Another consideration is the potential for cached generated values being lost with a system crash. System failures are a pretty rare event in the IBM i world, but if a crash did occur all the cached generated values for an identity column will be lost.

 

Let’s say that this simple employee table was created with an identity column cache size of 100 and that a system failure occurs after first 50 employees have been added to the table. After the system IPL, the next employee number value will be 1010 and not 510.  This gap in occurs because the unused generated values (i.e., 510-1000) cached in memory were lost during the system crash. 

 

This gap potential may seem concerning, but don’t forget your identity columns will often have gaps in their values without a system crash. For example, if a row is deleted, then that generated identity value is not reused. If an Insert statement running under commitment control gets rolled back, the generated identity value for the rolled back Insert is also not reused. 

 

Caching in the IT world can often provide benefits, so I hope that you’ve learned how caching can be beneficial with Db2 for i identity columns.  If you’re interested in learning about other SQL performance tuning techniques like this one, IBM is hosting a Db2 for i SQL Performance enablement workshop December 10-12 at its facility in Montpellier, France that I will be helping lead. Contact me if you're interested in additional details.

4 comments:

  1. Good article, ... but I think you should also add, because of caching there is no guarantee of uniqueness. For example when restarting manually or the maximum number is reached and a cycle is started.

    For guaranteeing uniqueness a primary or unique key should be implemented on Identity Columns.

    ReplyDelete
  2. Thanks. Yes, unique key constraint or unique index are always needed to guarantee the uniqueness of an identity column. However, that's true whether or not the Cache setting is used. The next set of sequential values for the column are stored in the cache - it contains no duplicate values.

    ReplyDelete
  3. Thanks. From a design standpoint, I think that directly coupling a table with an internal identity id is ok and usable for surrogate/technical keys but for natural/business keys can present problems in normal operations and pitfalls. Sequence objects provide surely more visibile control and design tradeoffs. For business objects with legal requirements requiring number continuity like i.e. invoices probabily a number table managed with custom program is best tradeoff (i.e. a crashing invoice program doesn't leave "holes" behind).
    Maybe enlighten us on IDENTITY vs SEQUENCE in term of performance, implementation etc. :D ?

    ReplyDelete
    Replies
    1. Sequence objects do offer more flexibility in that you can use them to generate alphanumeric key values. I'll have to run some tests to compare the performance of Identity columns and Sequence objects.

      Delete