[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ProgSoc] NULL values in primary keys in MySQL



(I know I should proof read this before I hit send. ...but, I'm not going to. Have fun. :)

jedd wrote:
] The problem with ] this is that you have to configure all of your entity relationships ] while you are holding a database transaction open. This creates two ] problems: 1) resource contention in the database, 2) data-access code ] and business-logic code are necessarily conflated under this scheme.

 Ah, yes.  I'd pondered the implications of this stuff, but had been
 happy to mostly ignore it because of the low performance requirements
 of doing this, and the belief that the practical aspects of doing the
 data preparation work would be done via a single form (.php page if
 you prefer).  When that page is submitted, I'm happy to write-lock
 the table(s) involved, have some catches there for errors or timeouts,
 do the inserts, then unlock.  This stuff works very fast, and the
 performance costs involved with locking the whole table just aren't
 worth worrying about - they're that small.

In your situation, if I had to weight 1) and 2) I'd be giving 1) a relative worth of < 1%. All the value in GUIDs would be in a simplified programming model. Oddly enough, there is a minimum sophistication required before separating data-access and business-logic code becomes useful as a means of 'simplification'. I'm not sure that your application is sufficiently complex enough to worry too much about this. That is, you have a bunch of code in one location (the .php page) and it does the whole thing: retrieve data submitted via HTTP, run some sanity checks, verify authorisation, connect to the database and munge it in there. (i.e. you're not likely to care. It's fair to say that in a simple environment like this that GUIDs could introduce as much complexity as they remove).


 OTOH, I do realise that this isn't the way you'd do it in the real
 world, because it simply doesn't scale.  Pragmatism .v. theory,
 though, rears its head here again.  Make the app work just as
 well as it needs to - but no more so - as I think you or Myles were
 advocating, in a different thread, a while back.

OK, I don't want to flog the pragmatism vs. theory horse. I'm in my own camp around this issue, I don't have either religion. I believe that a firm understanding of the theory will help you to be pragmatic. I believe there can be false economies in what some people call 'pragmatic' (i.e. "it's horrible, I know, but it doesn't need to be better.") and intangible and hard-to-quantify value in what people might like to write-off as "purely academic" or "architectural purity".


(On a related note, I wouldn't flog 'simplicity' as my mantra, either. Certainly I love simple and elegant solutions, but often problems don't lend themselves to 'simplification', and in trying to 'compress' a solution more apparent complexity is introduced. (In this case 'compressing' would be something like using auto-incremented keys rather than GUIDs, because it 'feels' simpler (although the emergent reality being that it creates complexities that otherwise wouldn't be there)). Was it Kernighan or Ritchie who said that "any problem can be solved by adding another layer of indirection"? (Except performance.. ;) It is this sin of 'over compression' (pleading 'simplicity' and/or 'power') that I feel Ruby pundits are most guilty of. That's not to say, though, that it doesn't have its place.)

So, in that context I'd say that the way I make design trade-offs is with development time vs. knowledge. Witness our recent discussion about imaging disks, for example. There was a 'practical' aspect to that centring around "how much time" there was. Then there was a 'theory' aspect (i.e. knowledge) with respect to what I already knew. So, my decision centred around making the right choice *in the circumstances*. I.e. you need to accept that you can't 'know' everything (not that bold assertion will stop me conducting a life-long empirical experiment to disprove this hypothesis :) but you need to take 'action' in a time-frame informed by what it is that you 'do' know. You're applying the same tactic, btw. (See your forthcoming admission that "[otherwise], it'll never get done.")

It is true that GUIDs provide scalability options that didn't used to exist (for example, because keys are allocated in a 'global' space, you don't necessarily need to rely on a 'centralised' authority to allocate them. If you had a massive number of transactions (e.g. amazon.com) then you wouldn't really want to be relying on 'one' database with 'one' table which you needed to get a lock on in order to insert a new record and be allocated your new key. Instead you could just allocate yourself a key and tell (any one of) the de-centralised data-stores that it needs to record this new transaction. (...and then later you aggregate this stuff, or split it off to where it needs to go, etc.) (I should have been a LISP programmer)) but that's not really the primary motivation for their use (in my view). They also introduce their own overhead too, as you point out below (i.e. they 'waste' keyspace). So, they're not there for scalability or performance so much as to 'simplify' and 'generalise' the concern of 'identification'. That this generalisation incurs some overhead and offers some advantages doesn't bear on the fact that of itself this generalisation is useful as a means to simplification. (Simplification being 'managing complexity', not 'ignoring it' or 'pretending it doesn't exist'. Ahem. :)

 If I was intending to start programming for dollars, or even more
 often than a few late nights every couple of months, then I'd also
 find it more compelling to worry more about the scalability of my
 designs.  Normally I'm a perfectionist with things I'm learning, but
 I know if I adopt that attitude with this stuff, it'll never get done.

Right. (It's not as if 'programming for dollars' changes this dynamic either, mind you.)


] ... allocatable (is that a word?), but they are a single column, they ] are the same data type, they are the same data format, and they are ] unique in a domain-wide space (i.e. if you stumble across a random guid, ] then the guid will *imply* its type if you look in the database across ] all PKs; whereas if you stumble across a random integer, it might occur

 Hmm .. this concerns me.  How often do you [guys] come across a
 piece of data and not know where it came from, what it means, or
 what you should do with it?  Or rather, how often does a programme
 do this -- and do you really code in handlers for random bits of data?

Please don't make my write this essay. :)

Just think about the 'polymorphism' which is possible when the means of identification is homogeneous. Then think about the keyword 'is' in Java or C#. That's what I was really talking about.

GUIDs are like pointers. (In fact, they are *exactly* pointers, but with associated meta-data). Ever passed a void pointer? (Or, more pertinently, ever been debugging and stumbled across 32 bits which sort of looked like a pointer and wondered to what it referred?)

Auto-incremented identity columns, on the other hand, are more like 'handles'.

GUIDs are pointers in a *universal* virtual address space. (Unlike 'pointers', which point to values in a process's virtual address space.)

] .. [entity] table. You can then encapsulate your database with ] stored-procedures which encapsulate this implementation detail. ] Though I doubt you're likely to care about this in your scenario.)

 I've made a policy decision to avoid stored procedures.  In no small
 part because the complexity simply isn't there, and of course the
 client and db are both on the same machine.

It seems it can be a contentious issue with hard-lines being drawn by various camps. I'm somewhere in the middle. I like sprocs because they help mitigate the possibility of SQL injection attacks. If you're building dynamic SQL with a string then you can really fuck up here. (For example, I've seen applications where it was possible to delete the contents of the database from the login screen.)


That said, you don't necessarily want to be limiting your 'flexibility' by requiring that everything be a pre-fabricated sproc.

To solve this problem, I make a clear distinction between OLAP and OLTP systems... (Here we go again... *shrugs*. :)

Transaction processing systems should use sprocs and can cause side-effects in the data-store. Reporting systems use dynamic SQL and do not cause side-effects in the data-store. The relational schemas suited to transaction processing and reporting applications vary wildly (i.e. OLTP = entity-relationship model, BCNF; whereas OLAP=star/snowflake schema, multi-dimensional cubes, etc.) There is some 'grey area' on the edges of these extremes. My acid test is whether the performance is acceptable. If you need aggregates from your transactional system, and can't get them in real-time, then tough luck: time for an OLAP system.

Of course, you are (naively, pragmatically, or both?) conflating your OLTP and OLAP systems. If you can't get the performance you need, then you need to begin working on your OLAP strategy. (This will involving a data warehouse comprising facts, partitioned data, and pre-aggregated data.)

Don't denormalise your schema until after you *know* that you have a performance problem. (Btw, you won't have a performance problem over this issue in your application. Promise.) (Seriously, how many 'transactions' would a single member have over 5 years? Um, 'thousands'? Oh... oh, no! *Thousands*. Oh, what ever shall we do!? Look at *all that data*. Oh, my! It's so much! Oh, there's so many! Oooooooh. Ooooooh. (Sorry. :))

(Do put an index on the foreign key to account on the transactions table, though.)

 But yeah - the idea
 of a bit table mapping GUID's to INT's in an attempt to avoid putting
 GUID's in every table .. makes my nose bleed.

Settle down now, it's not *that* exciting. :)

It's actually pretty much a classic space vs. time trade-off.

] I don't quite understand your agenda here. Why are you fighting it? It ] strikes me that you're trying to be pragmatic. The thing is, ] auto-incremented integer keys are *not* pragmatic. Natural keys are ] *not* pragmatic. GUIDs epitomise pragmatism.

 Yeah yeah .. I think I'm getting that now.  I find the idea of
 'extra' data within my tables to be inelegant.

Whereas I find it to be an elegant generalisation. (There's no accounting for taste!)


 Data that I don't
 need, I mean, and that appears to duplicate the functionality of
 extant data that I do need.  Certainly I'm not worried about the
 disk that'll be taken up by the extra column(s) - it's likely
 immeasurably small.  And performance wouldn't be hit at all .. ?

Depends. :)

Obviously you'll have more data. You'll be 'wasting' keyspace.

Let me guess, you'd argue the virtues of ASCII over Unicode..? ;)

(Don't mention UTF-8, that'd be like cheating. ;)

 Actually - applying INDEXes to columns provides comparable
 performance to having them as primary keys, doesn't it?  Or have
 I misunderstood this bit too?

(I actually laughed out loud. (Not a cruel laugh, mind. More like "OMG, how am I going to answer *this* in two paragraphs or less."))


Dude! Stop asking these questions! :)

So... there are *different types* of indexes. There are *different ways* of storing indexes. There are *different constraints* which might be applied to indexed values.

A 'primary key', conceptually, has nothing to do with indexes. It's a 'constraint', it's a 'semantic', and it's 'meta-data'. The constraint is that no duplicate (or NULL!) values can exist in a primary key column. The semantic is that one table can have only one 'primary key', i.e. only one primary means of identification (used by 'foreign keys'). The 'meta-data' is that this column (or 'these columns', if that's your thing) has been nominated to fulfil this semantic. This all exists within the 'conceptual framework' that is 'relational theory' (relation theory according to me, that is. ;).

So... because a DBMS has this 'meta-data', which implies this 'semantic', it can understand some of the resultant implications. Further, because of the 'constraint' (assuming it's to be enforced), the DBMS must assume the responsibility of enforcing it. All of this conspires to make it *reasonable* (in the general case) to create some sort of index on a primary key field. ...and that's all I'm going to say. :)

(Indexing capabilities are very much 'implementation' dependant, and while relational theory and various standards (such as SQL-92) impose the 'conceptual framework' it's implementations which provide 'actual frameworks', and they all deviate somewhat what from "academic purity" and bow, often, to "pragmatism". :)

(OK, fine.. one last thing. Typically a primary key will have a clustered index. A clustered index is stored with the data, meaning that the data needs to be stored 'in order'. Meaning that if your primary keys are GUIDs (and not naturally 'incrementing' but rather turning up wildly 'out of order') and you're using a clustered index, then inserts are going to be *awfully* slow. A table can have only one clustered index (because the clustered index implies where the actual records themselves are stored) So... don't used your table's clustered index on its GUID primary key. (Caveat: this applies to SQL Server, I don't know anything about MySQL's indexing capabilities or implementation.))

] Are you aware that storing the balance in the account is denormalising ] your schema? Typically a summation of transactions would render the ] balance. I doubt that's likely to create a performance problem in your ] environment.

 Ha!  I hadn't thought of that .. but yeah, you're right of course.
 I was pretty chuffed a while back when I used normalisation to resolve
 my design problem of multiple users sharing the same pool of dosh,
 and was entertaining the idea of keeping a 'number of transactions'
 against each account.  But obviously that number could be determined
 from a count function where trans-from/trans-to matched the acc#.

Obviously. :)

 Two things I don't grok, here -- the performance .v. normalised schema
 tradeoff.

See my flagrant dismissal above. :)

 Dynamically pulling out everyone's balance by cycling thru
 all the transactions  -- this would be a major hit.

Did you know that you can buy 4 litres of wine in a box for $10. Now, that's progress for you! (Or is it? Hmm... hey, that's half tax. Wait a minute, this isn't progress! Cheap wine and taxes have existed through recorded history!)


Oh, I'm sorry. I was thinking about something else. I didn't mean to go off track. Where were we? Ah.. yes. You were making unsubstantiated claims about a summation over a handful of records being a "major hit". Right. See, about that... well, it's unsubstantiated. :)

 Secondly, was
 thinking that dynamically doing exactly that, whilst keeping a balance
 (or #-of-transactions count) and comparing the two results to make
 sure there'd been no corruption of the underlying data -- seemed like
 a sensible idea.  What's the rules of thumb for those two types of
 design compromises?

OLAP vs. OLTP.

OLAP = Online Analytical Reporting. Characteristics: denormalised schema (pre-aggregated or duplicated data), read-only (typically maintained periodically (every minute/hour/day/week, whatever) by off-line 'batch updates').

OLTP = Online Transaction Processing. Characteristics: normalised schema, ACID transactions, data mutations, (close to) real-time.

You can conflate these 'aspects' in a single database. The delineation between these two types of systems can become grey at the boundaries. For small systems you might like to treat your transaction processing data-store as a reporting data-source.

"Hi Joe, you conducted 15 transactions last month, and your current balance is $37." is a report. "Please transfer $12 to my wife's account" is a transaction.

 Absolute worst case, they simply generate a new account & member
 pairing, and handle them as separate entities -- which is what they
 would, in that hypothetical scenario, want to be doing anyway.

Yep. That's what I said would happen. If you don't care, you don't care. (Tautology is fun).


 But tell me -- what is the *effect* of changing the contents of
 a PK column in a record?  Given a single client talks to the DB,
 there's nothing sophisticated like temp tables with this design,
 what's actually the worst thing that can happen?  And would it
 be ameliorated if the record was deleted and then new ones (that
 just happen to have similar PK data) created immediately after?

You externalise those keys before you mutate them. So, say there is an account 'A'. Then I update 'A' to 'B'. Then a customer enquires about their account, 'A'. ...but, sir, there's no such thing as account 'A'!


(That's the effect. It can be more subtle, and create much larger problems.)

Ahhh .. the closest I've got to a job offer for two months! ;)

I'd pay you to fix it, except I haven't got a business case for it at the moment. (I.e. the gear I was referring to was just stuff I have lying about for my own (eventual) personal gratification (and study).)


The offer stands, so long as you'll work for beer. ;)

Is this the same gear that recently got W2K3'd, or other stuff?

Nope. I was referring to different gear. (In fact, I didn't even have that new gear at the time I made my original comment.) I have 24 computers, in various states of 'used' and/or 'administered'.














-
You are subscribed to the progsoc mailing list. To unsubscribe, send a
message containing "unsubscribe" to progsoc-request@xxxxxxxxxxxxxxxxxxx
If you are having trouble, ask owner-progsoc@xxxxxxxxxxxxxxxxxx for help.