Uh-oh… Semantic Alert. The following post expresses my opinion on a thorny issue. But, I thought I’d put my take on it.
The Key Choices
When you construct a database, you often have to make decisions based on incomplete information. Oftentimes you’re given a small subset of data and you have to decide what kind of datatype goes where, how entities naturally link to each other and when to denormalize.
There’s one debate when it comes to building out tables for databases and that’s what do you use as the key. Essentially, you have two choices.
- A unique code that represents the item ( called a natural key )
- An incrementing integer that has no connection to the item. It’s just a number that goes up and up and up and up…
The great thing about using numbers is you’ll never run out of them. Yay for future proofing your database. A 64-bit number can go up to 9,223,372,036,854,775,807 ( that’s 9.2 pentillioin, by the way ) so you’re pretty good. If you added one record to your table every microsecond, you wouldn’t run out for 292 million years.
This is one reason experienced database admins say just use an incrementing integer. The problem with numbers is that the numbers chosen have no intrinsic meaning other than the order in which they were created. So, when you get to record number 125,341 it doesn’t mean anything. It can be harder to deal with. After all, that number REPRESENTS a piece of data. Wouldn’t it be better if that was more than just data, if the key itself conveyed useful information?
To be fair, sometimes there are really good natural keys. Like, how about a UPC? Those are unique, right? How about a Stock Ticker? There can be only one of those, right? The VIN on my car! That’s unique right? These keys convey meaning and they’re unique. Why not use one of these instead of an incrementing integer that has no intrinsic meaning?
The problem with Natural Keys
Natural Keys are rarely ACTUALLY unique. We may think something like a UPC code or a VIN is unique, but either through human error, accidental duplication or other reasons that we never thought about it’s not. We may think a Stock ticker is unique, until we realize that in other markets, those stock exchanges have used the same three letters to represent a stock. Maybe we say that’s not a problem, but it is if somehow something like the internet comes along. Now, how do you look up a stock like
XEC on the TSX. On the NYSE, it’s an energy company’s stock ticker, on the TSX, it’s an ETF that represents Emerging markets. These stock tickers aren’t that much of a big deal as these markets are separate, but if you’re building a Stock checker to record the history of an equity, and want to put that in your database you probably don’t want to to use ONLY the ticker, as you can buy from various markets with the touch of a button.
When a key LOOKS unique, it may not be.
The UPC-A standard is regulated by a 3rd party and stands for “Universal Product Code.” You apply to use a UPC and you’re not supposed to make up one of your own. At times, however, a UPC does NOT represent a unique entity. Take, for example, a can of baked beans:
|0-12345-00001-0||Granny’s Baked Beans||Maple|
The UPC that represents a single can of Granny’s Maple-flavored baked beans is supposed be unique. No other product should use that code. If Granny wants to sell one with bacon-bits alongside it, it would need its own UPC, If Granny wants to sell a large-size version with the exact same product, she’s gonna need another UPC, if Granny wants to sell the same product with different language labelling… well you get the point.
It seems like the UPC is a good candidate for a Unique Key. So, you start a database to track all your production of Granny’s Baked Beans and things are going really well. You build a manufacturing plant and maybe even add a few flavors.
In your database, you see that 12-digit number and say to yourself… yes an easily readable and understandable natural key.
Change is tough to anticipate
One day, you get a report that the FDA is going to be disallowing a particular chemical in foods. All foods that have solvabuttfacinoligan must stop using it. Uh-oh, solvabuttfacinoligan is the exact ingredient that gives Granny’s baked beans their addictive quality. You have to recall and change the formulation right away!
The Conundrum… the choices
You have two choices here. Create a new product without solvabuttfacinoligan or remove solvabuttfacinoligan from your existing product. Creating a new product is easy, reformulate, give it a new UPC and resell it to the stores. You already have a formula that you’ve been working on that doesn’t have solvabuttfacinoligan and early tests have shown people like them even better! That’s a clean way to deal with the problem, and you’ve got 99,999 UPCS available to you.
|0-12345-00001-0||Granny’s Baked Beans||Maple||Discontinued|
|0-12345-00001-0||Granny’s Even Better Baked Beans||Maple||NEW|
Here’s the problem, though. The FDA hasn’t made this a public issue. They’ve simply decided to send a letter to companies asking them to remove the solvabuttfacinoligan from their products ASAP. In addition, a new product means you have to renegotiate with larger big-box stores like Costco to list this new SKU. In addition, a new product means all these stores will have to change all their internal systems. Finally, a new product may alienate consumers who loved the original product.
So, you decide to reformulate the existing product and keep the same UPC. Uh-oh… you probably know where this is going. You want to accept returns of any of the old products and you want to update the labelling on the product, but you really have two products now, sharing one UPC. You have to keep data for old and new product separate so you can show the FDA if they come knocking, but you used the UPC as a key. You want to store the new manufacturing technique in your database as well as the different nutritional data in your database but you used the UPC as a key. Since you shouldn’t create a duplicate Primary keys in a database you’re stuck.
Unique keys aren’t tied to anything other than the database row they represent
The solution is not to use a natural key. Although there are some advantages when it comes to readability, If you’d just used a Unique number not connected to the product, then it wouldn’t be a problem. Just duplicate the product with a new KEY using the same UPC and change the ingredient list and instructions on how to manufacture and any other linked data.
Sometimes we go against conventional wisdom thinking we know better. However, what we often realize is that certain things are done a certain way because people before you have had to learn the hard way. Since it’s impossible to know exactly how your database will be used in the future, it’s good to allow as much ambiguity in how the data is stored as possible.
Avoiding natural keys when possible is just one way you can future proof your data.