flazz.me

  • Grok Cassandra's Datamodel

    • 17 Jul 2011
    • 1 Response
    •  views
    • cassandra database nosql
    • Edit
    • Delete
    • Tags
    • Autopost
    Keyspace = {
    
      Column Family : {
    
        Row Key : {
          Column Name : 'Column Value'
          Column Name : 'Column Value'
          Column Name : 'Column Value'
          ...
        }
    
      }
    
    }

    Think you understand it? Pretend that you don’t know the meaning of Row, Key or Column. I wish someone told me that before I got into Cassandra.

    Pop your stack: the relational model (mostly) does not apply

    A paradigmatic dichotomy of data modeling is the notion of serial vs. parallel. Whether in databases or programming languages etc., a lot of effort has been made to express data in series, parallel, or some composition of both.

    Series

    a number of things, events, or people of a similar kind or related nature coming one after another …

    In programming languages you have arrays, lists, vectors, etc.; in databases you have tables of rows. In a database you might find a table named user, probably stores one or more user records; any record in the user table does not necessarily relate to any other; any limits to the number of records in the user table are implementation dependent and for the most part be considered arbitrary.

    Parallel

    occurring or existing at the same time or in a similar …

    or

    The opposite of series.

    Databases represent these as row of columns; Programming languages have tuples, maps, hashes, structs, classes, etc. Continuing the database example, a record in the user table probably stores exactly one fixed size set of facts about a user as fields; all fields are assumed to be true for that user in some greater disposition; altering the set of fields fundamentally changes the model.

    Does Cassandra do this? Yes & No, at the same time.

    In Cassandra you can have many columns in a row, at the time of writing about 2 Billion. Major relational databases have limits on the order of Thousands. If unintuitive, take a moment to grok the difference between Thousands and Billions, we are not comparing apples to apples.

    In this example, rows model entities and columns model keys. Depending on the replication strategy, bic and pilot might not exist on the same node. This is something to consider when defining the Keyspace.

    Manufacturers = {
    
      bic : {
        origin : 'france'
        year : 1945
      }
    
      pilot : {
        origin : 'japan'
        year : 1918
      }
    
    }

    In this example rows are series of model numbers and their respective description.

    ProductDescriptionsByMfgr = {
    
      bic : {
        FRM41 : 'BIC 4-Color Ballpoint Pen Refill, Fine Point'
        MRM41 : 'BIC 4-Color Ballpoint Pen Refill, Medium Point'
      }
    
      pilot : {
        77227 : 'Dr. Glip, Better & EasyTouch Retractable Ballpoint Pen Refill, Medium, Black'
        77228 : 'Dr. Glip, Better & EasyTouch Retractable Ballpoint Pen Refill, Medium, Blue'
        77210 : 'Dr. Glip, Better & EasyTouch Retractable Ballpoint Pen Refill, Fine Point, Black'
        77221 : 'Ballpoint Pen Refill, Medium Point, Black'
        77222 : 'Ballpoint Pen Refill, Medium Point, Blue'
        77215 : 'Ballpoint Pen Refill, Fine Point, Black'
      }
    
    }

    You should be experiencing some cognitive dissonance. It’s OK, Cassandra conflates series and parallel within a row. So add columns all day long, but still add rows all day long too.

    It is important to know that all columns in a row are sorted. If I add another column to this row it will be inserted at the correct position. Even in the previous example, the column origin will precede year. How columns are sorted within a row is configurable per Column Family.

    Also, no distribution or replication takes places within a row: each copy of a row will contain all columns; conversely if your replication doesn’t take this into consideration integrity or performance may suffer.

    If you want something a more complex than a single datum as a column’s value check out Super Columns

    The norm is to Denorm: model for queries.

    If the intent of a query is to retrieve a single item then model that single item:

    Procucts = {
    
        77227 : {
          description : 'Dr. Glip, Better & EasyTouch Retractable

    Ballpoint Pen Refill, Medium, Black'

    manufacturer : 'pilot'
          price : '$1.59'
          quantity : '2'
        }
    }

    If you need a listing of products with descriptions per manufacturer, store exactly that (as in the ProductDescriptionsByMfgr example). Don’t feel the need to normalize.

    Is this a waste of disk? disk is cheap, querying a big dataset is not.

    Mixing implementation and domain modeling …

    It may seem as if there is no clean separation of domain modeling and actual implementation. The structures available in Cassandra come with many critical implementation-specific strings attached. Very true, but before holding this against Cassandra, consider if other database systems are free from this. Things like replication, sharding, data warehousing, denormalized data, etc. are common in decent sized implementations and will definitely leak back into the domain modeling.

    Terminology killed the cat

    Not only is Cassandra’s terminology confusing it’s downright misleading. Row, Column & Key all have existing semantics in the land of databases. To make matters worse, Cassandra’s definitions are not even orthogonal to the existing ones — they exist in a difficult state of quasi-synonymity.

    Despite this disservice, the set of small unused words apropos of database is probably depleting as fast four-letter English profanity. I’ll take key over distributed ordered set descriptor any day.

    I was thrown into the Cassandra pool without knowing how to swim, I hope this helps anyone in the same situation. Expert swimmers out there please correct me where wrong.

    • Tweet
  • redis: the AK-47 of databases

    • 27 Jan 2011
    • 13 Responses
    •  views
    • ak-47 database nosql redis reliable replicate scale simple
    • Edit
    • Delete
    • Tags
    • Autopost

    TL;DR Redis the AK-47 of databases

    AK-47 image from wikipedia

    Easy

    Installation was super easy, on a mac:

    brew install redis

    Starting it up was just as easy:

    redis-server /usr/local/etc/redis.conf

    Using it?

    % redis-cli
    redis> set this.is.a.key "this is a value"
    OK
    redis> get this.is.a.key
    "this is a value"

    Redis also supports more complex data stuctures, but the idea is the same: you set some state to a name. No tables, no schema, no JSON, no map nor reduce. If you don’t get the juxt of this, then please take out your safety pencil and a circle of paper.

    Simple

    The protocol is machine & human readable. Don’t believe me?

    % nc localhost 6379
    get this.is.a.key # i typed this
    $15               # redis says 15 characters are coming
    this is a value   # redis sent 15 characters

    The stock config file has on the order of 30 options, 25 uncommented active ones:

    % grep -v '^#\|^$' /usr/local/etc/redis.conf
    daemonize no
    pidfile /usr/local/var/run/redis.pid
    port 6379
    timeout 300
    loglevel verbose
    logfile stdout
    databases 16
    save 900 1
    save 300 10
    save 60 10000
    rdbcompression yes
    dbfilename dump.rdb
    dir /usr/local/var/db/redis/
    appendonly no
    appendfsync everysec
    vm-enabled no
    vm-swap-file /tmp/redis.swap
    vm-max-memory 0
    vm-page-size 32
    vm-pages 134217728
    vm-max-threads 4
    glueoutputbuf yes
    hash-max-zipmap-entries 64
    hash-max-zipmap-value 512
    activerehashing yes

    It’s not that hard to guess what most of these are, but why guess when stock config is so well documented? It seems like the useful 80% of what you can know about the server side is documented there.

    There is one authentication method: a password, off by default. If you specify a password in the config file then its simple:

    redis> AUTH thepasswordintheconfigfile
    OK

    Now you can use the database, just use it, seriously. But be warned:

    Note: because of the high performance nature of Redis, it is possible to try a lot of passwords in parallel in very short time, so make sure to generate a strong and very long password so that this attack is infeasible.

    Backup? Use cp, as in /bin/cp …

    % cp /usr/local/var/db/redis/dump.rdb /some/place/to/backup.rdb

    Need some more dials, knobs, domain specific configuration languages, obfuscated enterprise grade protocols, ldap authentication, Photoshop integration or some sort of bean? Call up Larry Ellison, he’ll be glad to provide you with an aspect certified enterprise-oriented solution.

    Powerful

    Does it do transactions? Yes

    redis> MULTI
    redis> ...
    redis> ...
    redis> ...
    redis> EXEC

    The above code will get exclusive rights to the dataset. All other clients' commands will queue up. EDIT: this explanation is wrong, commands are queued, clients are not locked, please see the the docs

    But Redis also has optimistic locks that don’t exclusively lock the entire dataset.

    redis> WATCH mykey
    redis> ...
    redis> ...
    redis> ...
    redis> EXEC

    if mykey is modified by another connection any time between the WATCH and EXEC commands, a rollback will result. The important part is that if no one is stepping on anyone else’s toes (data) then there is no queueing. If by chance you do step on someone’s toe, just back off for second and try again.

    Pub/Sub

    Pub/Sub is a unique and useful feature. A client publishes messages on a channel and zero or more subscribers receive that message. Its kinda like IRC.

    # client 1
    redis> SUBSCRIBE the.chan
    Reading messages... (press Ctrl-c to quit)
    1. "subscribe"
    2. "the.chan"
    3. (integer) 1
    
    # client 2
    redis> PUBLISH the.chan "hey guise whats cookin"
    
    # client 1
    redis> SUBSCRIBE the.chan
    1. "message"
    2. "the.chan"
    3. "hey guise whats cookin"

    Not enough XML for you? Go fsync yourself.

    No Surprises

    Does it scale? Yep, via replication. One master to many slaves. The slaves ask the master for new data, then everyone has the same data.

    The entire dataset is in RAM. This makes things as fast as well, RAM, always. What about when the power goes out? It’s periodically saved to disk.

    No virtual memory by default. Everything really is in RAM. If you don’t have enough RAM then turn virtual memory support on.

    What about between saves? A log (since the last save) is kept on disk. It can be replayed to recover unsaved data.

    Is that log fsynced? Every second by default, but you can configure it.

    There are potential windows of time when data can be lost. The important thing is that those windows are known and can be considered instead of surprising you.

    AK-47 Appeal

    Why Redis appeal to me? Pareto’s principle could be applied: When using software, one could argue that 80% of the time is spent using 20% of features. Redis seems to implement the vital few features very well.

    According to AK-47 legend, assault rifles were not popular because of their tendency to consume large amounts of ammo. The Soviets embraced the idea and simply supplied their troops with more ammo. Kinda like Redis and RAM.

    There are more and precise guns out there, but you can’t pack them with wet dirt and expect them to fire unconditionally. There are more sophisticated designs, but factories cannot crank them out anywhere in the world, 2nd world or 3rd world. This is because the design of the AK-47 is, for lack of a better term, simple. There is inherent robustness in simplicity. You can shoot an AK-47 if you have two things: an AK-47 and ammo. You can host or replicate a Redis dataset on any machine that has two things: Redis and enough RAM.

    In the spirit of full disclosure, I’m a newb to Redis. My knowledge is basically the contents of this post (at the time of writing). I don’t use it in production (yet). Likewise I’m no expert in AK-47s (I’ve never even fired one) or guns in general. I’m aware via notoriety alone.

    See also: Unix – The Hole Hawg, by Neal Stephenson

    • Tweet
  • About

    programmer in austin

    141075 Views
  • Archive

    • 2011 (6)
      • July (2)
      • May (1)
      • February (1)
      • January (2)
    • 2010 (9)
      • December (6)
      • November (3)

    Get Updates

    Subscribe via RSS