Jul 29, 2024 2 min read data engineering

Keys, Keys Everywhere: Understanding Primary Keys

Recently, I have been writing an internal jargon buster for data engineering terms and concepts. Many industries have their own language, and one of the first speedbumps when learning anything is getting through that language barrier (which, depending on insularity/snottiness at times, maybe easy or hard to do). I was writing a definition for Normal Forms, when I realised I would also need a definition for keys. But how best to explain all the different types of database keys?

Firstly, a key is a concept more than it is a nameable object - the point of keys are to allow us to understand how data in different tables (relations) fit together, so we can use them to represent the truth. Keys ultimately help you identify specific rows in the data, ensuring integrity (such as avoiding duplicates) and informing you how it might connect to other tables. When we eyeball a table, we also do this mentally - a formally enforced key system is a way of making sure everyone using the data does this the same way, and agrees on what the truth is.

This is particularly important when trying to remove data redundancy from the design of a database, because as humans, keeping things readable for us often means we store the same information more than once.

Primary Key

The first one we often encounter, a Primary Key (PK) uniquely identifies the observations (aka "row" or "tuple") within a table (aka "relation), within a database schema. A PK can be a single field, or made up of more than one field, but it/they cannot be null and it cannot be duplicated - there can also only be one PK per table. As such, PKs are often used to guarantee database integrity, particularly when adding new data or updating an existing row.

A PK is often assumed to exist within a fact or fact-like data source (one which seems to be recording events and observations, as opposed to a dictionary like 'dimensions' data source) but this assumption is well worth checking. A PK is best enforced as a constraint on the table, to ensure those integrity benefits are realised, but that means the PK has to be set already (ideally when the table is created).

...why wouldn't there be a Primary Key?

Although it feels like the "only" way of doing things, Primary Keys, Normal Forms and all that are part of a database design framework (a Star Schema) - which is a choice to follow. There are other frameworks, like One Big Table (OBT) which pretty much throws everything into... one big table and then works out smaller, aggregated views and queries afterwards - they can still use key systems, but often in quite a different manner. Fivetran has an interesting blog where they compare performance of cloud data services with different schemas and they found fairly significant performance differences between the schema designs on different platforms. Primary Keys do require computational power/energy to enforce, and if using a SQL flavour which cannot generate its own key references (or surface them for someone to use), it might feel like a faff if you just need to get something done. Of course, if you do everything quick and dirty you end up with a mess, and that equally applies to database modelling and management, but that's humans for you.

FragileData

A data oik in chaos central.

London