Keyspace = {
Column Family : {
Row Key : {
Column Name : 'Column Value'
Column Name : 'Column Value'
Column Name : 'Column Value'
...
}
}
}Think you understand it? Pretend that you don’t know the meaning of Row, Key or Column. I wish someone told me that before I got into Cassandra.
Pop your stack: the relational model (mostly) does not apply
A paradigmatic dichotomy of data modeling is the notion of serial vs. parallel. Whether in databases or programming languages etc., a lot of effort has been made to express data in series, parallel, or some composition of both.
Series
a number of things, events, or people of a similar kind or related nature coming one after another …
In programming languages you have arrays, lists, vectors, etc.; in databases you have tables of rows. In a database you might find a table named user, probably stores one or more user records; any record in the user table does not necessarily relate to any other; any limits to the number of records in the user table are implementation dependent and for the most part be considered arbitrary.
Parallel
occurring or existing at the same time or in a similar …
or
The opposite of series.
Databases represent these as row of columns; Programming languages have tuples, maps, hashes, structs, classes, etc. Continuing the database example, a record in the user table probably stores exactly one fixed size set of facts about a user as fields; all fields are assumed to be true for that user in some greater disposition; altering the set of fields fundamentally changes the model.
Does Cassandra do this? Yes & No, at the same time.
In Cassandra you can have many columns in a row, at the time of writing about 2 Billion. Major relational databases have limits on the order of Thousands. If unintuitive, take a moment to grok the difference between Thousands and Billions, we are not comparing apples to apples.
In this example, rows model entities and columns model keys. Depending on the replication strategy, bic and pilot might not exist on the same node. This is something to consider when defining the Keyspace.
Manufacturers = {
bic : {
origin : 'france'
year : 1945
}
pilot : {
origin : 'japan'
year : 1918
}
}In this example rows are series of model numbers and their respective description.
ProductDescriptionsByMfgr = {
bic : {
FRM41 : 'BIC 4-Color Ballpoint Pen Refill, Fine Point'
MRM41 : 'BIC 4-Color Ballpoint Pen Refill, Medium Point'
}
pilot : {
77227 : 'Dr. Glip, Better & EasyTouch Retractable Ballpoint Pen Refill, Medium, Black'
77228 : 'Dr. Glip, Better & EasyTouch Retractable Ballpoint Pen Refill, Medium, Blue'
77210 : 'Dr. Glip, Better & EasyTouch Retractable Ballpoint Pen Refill, Fine Point, Black'
77221 : 'Ballpoint Pen Refill, Medium Point, Black'
77222 : 'Ballpoint Pen Refill, Medium Point, Blue'
77215 : 'Ballpoint Pen Refill, Fine Point, Black'
}
}You should be experiencing some cognitive dissonance. It’s OK, Cassandra conflates series and parallel within a row. So add columns all day long, but still add rows all day long too.
It is important to know that all columns in a row are sorted. If I add another column to this row it will be inserted at the correct position. Even in the previous example, the column origin will precede year. How columns are sorted within a row is configurable per Column Family.
Also, no distribution or replication takes places within a row: each copy of a row will contain all columns; conversely if your replication doesn’t take this into consideration integrity or performance may suffer.
If you want something a more complex than a single datum as a column’s value check out Super Columns
The norm is to Denorm: model for queries.
If the intent of a query is to retrieve a single item then model that single item:
Procucts = {
77227 : {
description : 'Dr. Glip, Better & EasyTouch RetractableBallpoint Pen Refill, Medium, Black'
manufacturer : 'pilot'
price : '$1.59'
quantity : '2'
}
}If you need a listing of products with descriptions per manufacturer,
store exactly that (as in the ProductDescriptionsByMfgr example). Don’t feel the need to
normalize.
Is this a waste of disk? disk is cheap, querying a big dataset is not.
Mixing implementation and domain modeling …
It may seem as if there is no clean separation of domain modeling and actual implementation. The structures available in Cassandra come with many critical implementation-specific strings attached. Very true, but before holding this against Cassandra, consider if other database systems are free from this. Things like replication, sharding, data warehousing, denormalized data, etc. are common in decent sized implementations and will definitely leak back into the domain modeling.
Terminology killed the cat
Not only is Cassandra’s terminology confusing it’s downright misleading. Row, Column & Key all have existing semantics in the land of databases. To make matters worse, Cassandra’s definitions are not even orthogonal to the existing ones — they exist in a difficult state of quasi-synonymity.
Despite this disservice, the set of small unused words apropos of database is probably depleting as fast four-letter English profanity. I’ll take key over distributed ordered set descriptor any day.
I was thrown into the Cassandra pool without knowing how to swim, I hope this helps anyone in the same situation. Expert swimmers out there please correct me where wrong.
