[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ProgSoc] NULL values in primary keys in MySQL



Myles Byrne wrote:
If we’d used the ISBN as the primary key in a table of books, we’d have to
go through and update each row to reflect this change. But then we’d have
another problem. There’ll be other tables in the database that reference
rows in the books table via the primary key. We can’t change the key in the
books table unless we first go through and update all of these references.
And that will involve dropping foreign key constraints, updating tables,
updating the books table, and finally reestablishing the constraints. All in all, something of a pain.

This is the red-herring I referred to RE: Jedd's question about immutable data. The above glosses over real (as opposed to imagined) complexity, possibly lulling you into a false sense of security. (Although I use surrogate keys for much the same reason, that is, so that I can have a 'generic' keying infrastructure that is not at the mercy of domain specific data-types/formats. Changing values of natural keys is a separate matter, though it is also related to surrogate vs. natural keys.)


Adding the indirection of surrogate keys does make it easier to modify the natural key data, but that only makes it easier for you to completely bollocks up your database, because if you're mutating the data in candidate keys (i.e. the 'natural keys', likely being externalised and used as keys in the problem domain, e.g. "Hello, welcome to Acme Inc. Can I please have your account number sir?") then it might take you a little while before you start to begin feeling the world of pain you've unleashed upon yourself, but you will eventually feel it (someone will come to your office and ask you to regenerate last months report, or they'll come and enquire why having just re-run the end-of-year report the one they see now differs from the one that the CEO released to share-holders last week, etc.).

Also, if you let surrogate keys 'leak' out of the system into domain usage then you will create a problem. E.g. "Can I please have your Account Number sir? ABC. No, I'm sorry sir, that looks like an Account Code, you should have your Account Number on the top of the TPS report. Oh, 456. Um, sir, I think you've given me your Account Id, I need your Account Number." etc. (On a related note, I recently signed up with Melbourne IT for a simple SMTP hosting package, and by the time I was finished I'd been given at least 4 usernames.)

The book I referred to provided definitions for these orthogonal 'types' of of things [1]. If you think surrogate keys make changing natural keys easy, and that's important to you, then presumably you care because you're mutating natural keys, in which case you'd better understand the difference between all of these concepts, or else it's time to start reading [2]. :P

Some databases can afford to re-write history, but most serious apps need to know things like "this person's last name used to be Smith, but as of the 12th it's Bloggs, according to our call-centre operator Mary who logged a Change Of Details form on behalf of the customer last week", or "account number 123 was active on the 23rd when this transaction occurred, but de-activated on the 30th by Joe in Operations because he found that the person whose last name used to be Smith fraudulently claimed they'd been married and changed address", etc.

Accountants learnt the virtues of double-entry book-keeping and immutable data a long time ago. Time-series data and 'version control systems' can be difficult, but if you need that functionality, then you need to tackle the complexity.

The real problem with mutations to natural keys is that usually changing the value of the natural key changes the semantics of that entity in the problem domain, meaning, often, that entity no longer represents what it used to represent, meaning that any historical facts referring to the 'old' meaning now pick up the 'new' meaning, which might turn fact into fiction. For example, say I have a schema like this:

 CATEGORY( CAT_ID, NAME )
 QUESTION( QUE_ID, CAT_ID, TEXT )
 ANSWER( ANS_ID, USER_ID, QUE_ID, VALUE )

Say I have some data like this:

 CAT_ID   NAME
 ------   ----
 1        Category A

 QUE_ID   CAT_ID  TEXT
 ------   ------  ----
 1        1       What is your favourite colour?

 ANS_ID   USER_ID  QUE_ID  VALUE
 ------   -------  ------  -----
 1        123      1       Blue
 2        456      1       Red

We're using a surrogate key QUE_ID on the QUESTION table, though { CAT_ID, TEXT } is a natural key, thus a candidate key.

So, seeing as though it's now *so easy* for me to do so, why wouldn't I want to do something like this..?

 UPDATE QUESTION
 SET ( TEXT = 'Is blue your favourite colour?' )
 WHERE QUE_ID = 1


[1]

Temporal data-types:

 - Instant
 - Interval
 - Period

Temporal domains:

 - User-defined time
 - Valid time
 - Transaction time

Temporal statements:

 - Current
 - Sequenced
 - Non-sequenced

[2] http://www.cs.arizona.edu/people/rts/tdbbook.pdf





-
You are subscribed to the progsoc mailing list. To unsubscribe, send a
message containing "unsubscribe" to progsoc-request@xxxxxxxxxxxxxxxxxxx
If you are having trouble, ask owner-progsoc@xxxxxxxxxxxxxxxxxx for help.