This entry’s title is derived from the Catch Me If You Can movie – a good one if you haven’t seen it yet. This movie is based on the true story of a con man who takes on the identity of a pilot, doctor and lawyer as he steals millions of dollars by cashing fraudulent checks. You may be wondering how a story like this relates to Db2 for i since none of the con’s fake identities were IT professions like programmer or database engineer. However, Db2 for i database tables can take on an identity in the form of an identity column.
CREATE TABLE employee ( empno INTEGER GENERATED ALWAYS AS IDENTITY (START WITH 10 INCREMENT BY 10), name CHAR(30), deptno CHAR(4))
The
prior example didn’t include the CACHE setting for the identity column which
defaults to 20 when the CACHE option is not specified. Here’s the updated
version of our simple example to include the default setting for the CACHE
option.
CREATE TABLE employee ( empno INTEGER GENERATED ALWAYS AS IDENTITY (START WITH 10 INCREMENT BY 10 CACHE 20), name CHAR(30), deptno CHAR(4))
The default cache setting of 20 results in Db2 generating the next 20 values for that column and storing those generated values in memory. As new rows are written to this employee table, the generated values for the identity column are taken from the cached values in memory. This means that after 20 rows are inserted into the table, Db2 will have to generate and cache another 20 values in memory.
If the
table with the identity column has a low velocity rate of inserts, then the
default cache of 20 generated values will perform pretty well. However, what if the table has a high insert
velocity? That means that in the case
where the table is the target of a high rate of inserts, Db2 will have to pause
after 20 rows are inserted to generate values for the next 20 rows. If you have
a batch process that inserting hundreds of thousands of rows like one of our
recent IBM Technology Expert Labs clients, pausing after every 20 rows is not
very efficient. The client that our IBM
Technology Expert Labs team assisted was able to speed up the writes of all
those rows to the table by increasing the identity column cache size to 1024.
A larger cache size can be specified using the CACHE clause when a table is initially being created or after the fact with an ALTER TABLE statement like the following:
ALTER TABLE employee ALTER COLUMN empno SET GENERATED ALWAYS AS IDENTITY (START WITH 10 INCREMENT BY 10 CACHE 100)
The
obvious question to ask at this point is what’s the downside of using a larger
cache size. One tradeoff is memory. However, most IBM i servers have plenty of
memory to use these days … so you shouldn’t worry too much about the memory
needed to store 20 generated numeric values versus 100 numeric values. Another
consideration is the potential for cached generated values being lost with a
system crash. System failures are a
pretty rare event in the IBM i world, but if a crash did occur all the cached
generated values for an identity column will be lost.
Let’s
say that this simple employee table was created with an identity column cache
size of 100 and that a system failure occurs after first 50 employees have been
added to the table. After the system
IPL, the next employee number value will be 1010 and not 510. This gap in occurs because the unused
generated values (i.e., 510-1000) cached in memory were lost during the system
crash.
This gap
potential may seem concerning, but don’t forget your identity columns will
often have gaps in their values without a system crash. For example, if a row
is deleted, then that generated identity value is not reused. If an Insert
statement running under commitment control gets rolled back, the generated
identity value for the rolled back Insert is also not reused.
Caching in the IT world can often provide benefits, so I hope that you’ve learned how caching can be beneficial with Db2 for i identity columns. If you’re interested in learning about other SQL performance tuning techniques like this one, IBM is hosting a Db2 for i SQL Performance enablement workshop December 10-12 at its facility in Montpellier, France that I will be helping lead. Contact me if you're interested in additional details.
Good article, ... but I think you should also add, because of caching there is no guarantee of uniqueness. For example when restarting manually or the maximum number is reached and a cycle is started.
ReplyDeleteFor guaranteeing uniqueness a primary or unique key should be implemented on Identity Columns.
Thanks. Yes, unique key constraint or unique index are always needed to guarantee the uniqueness of an identity column. However, that's true whether or not the Cache setting is used. The next set of sequential values for the column are stored in the cache - it contains no duplicate values.
ReplyDeleteThanks. From a design standpoint, I think that directly coupling a table with an internal identity id is ok and usable for surrogate/technical keys but for natural/business keys can present problems in normal operations and pitfalls. Sequence objects provide surely more visibile control and design tradeoffs. For business objects with legal requirements requiring number continuity like i.e. invoices probabily a number table managed with custom program is best tradeoff (i.e. a crashing invoice program doesn't leave "holes" behind).
ReplyDeleteMaybe enlighten us on IDENTITY vs SEQUENCE in term of performance, implementation etc. :D ?
Sequence objects do offer more flexibility in that you can use them to generate alphanumeric key values. I'll have to run some tests to compare the performance of Identity columns and Sequence objects.
Delete