Data is becoming more and more important. Databases are incredible powerful. They underly most of the cutting edge technologies. They are prevalent and residue behind a huge fraction of website and crucial components of many computer systems or electronic device that maintains some amount persistent information. Besides data persistent, database systems provide a number of other properties that make them exceptionally useful and convenient: reliability, efficiency, scalability, concurrency control, data abstractions, and high level query languages.

It is important to understand database design and the use of database management systems for applications. Therefore, an extensive knowledge on relational model, relational algebra, and SQL are recommended for a programmer as these are widely used paradigms. There are also semi-structured and unstructured data formats. XML data including DTDs and XML schema for validation, and the query transformation languages XPath, Xquery, and XSLT may be of interest. Database design in UML and relational design principles based on dependencies and normal forms might also be mentioned in this respect. In regards from design and application-building perspective several key topics are relevant such as indexes, views, transactions, authorization, integrity constraints, triggers, on-line analytical processing (OLAP), JSON, cloud databases and emerging NoSQL systems.

Kinds of Databases

There are many different types of databases, among them the most common ones are relational, document and key-value databases.

  • Relational
    • PostgreSQL
    • MySQL
    • SQLite
  • Document
    • MongoDB
    • CouchDB
  • Key-Value
    • Redix
    • Cassandra
    • Riak
  • Graph
  • Object/Hierarchical
  • Spatial
  • Time-series / RRD
  • Search



The main theory/rules behind most/many databases is acid:

  • Atomicity
  • Consistency
  • Isolation
  • Durability


A database can only have two of the following properties:

  • Consistency
  • Availability
  • Partion tolerance


Relational database dedicates a schema. Columns, rows, relation and normalization. Queries need to be well formulated, as it is easy to be sub-optimal.

Looping and doing a separate .save() for every individual item is inefficient:

names = set()
for book in Book.objects.filter(year=2012):

Instead reformulate the statement to perform a single query:

names = Author.objects.filter(books__year=2012)\
                      .values_list('name', flat=True)\


MySQL has many issues to be aware of but can also be very fast (like flipper):

  • No transactional DDL
  • Poor query optimizer
  • MyISAM: Full-table locking, no transactions
  • Oracle
  • Very fast for some operations

Many things there do not work at the same time (often either one feature or another can be active). For instance, MyISAM allows full-table-locking, but no transactions (InnoDB should be preferred). Also the query optimizer is poor. The owner is Oracle (might be not good). However for some tasks its fast.


SQlite is great for development but not for production because of the following points:

  • Little integrity checking (slowly being fixed)
  • Impossible to do some table alterations
  • No concurrent access
  • Very low overhead
  • Very portable

SQLite has little integrity checking and it is impossible to do some table alterations. There is no concurrent access because it tries very hard to be a small (lightweight), portable database. On the other-hand this is also good because it is very portable and easy to install.


Its hard to find any issues with Postgres, despite a few ones:

  • Slow default configuration
  • Can be little harder to learn / less familiar
  • Almost too many features
  • Incredible reliable

Postgres has a slow default configuration and it can be a little hard to learn. But despite this obstacles is has almost too many features and is incredible reliable. Further, it has all the SQL features, transaction for everything and even great geospatial support.


In document databases there is no fixed schema. They have low barrier to entry and are closer to python databases.

Their generic problem is that they are immature. Migrations in the databsse are not needed, but must still be performed in the level of the code. Schema have their use as they are there for benefit and protection. So a document database is more risky. There is neither transaction nor integrity checks. There are lots of other issues, like MongoDB is not freeing up disk space unless the entire database is dropped and re-created. Also data can be lost upon a crash (that is now better).

  • Advantages:
    • No fixed schema
    • Low barrier to entry
    • Closer to python datatypes
  • Problems
    • Immature (but improving)
    • No transactions
    • No integrity checking
    'name': 'Hevok',
    'tags': ['chaos', 'order', 'simplicity', 'complexity']
    'addresses': [
       {'type': 'email', 'at': 'denigma.de'},
       {'type': 'phone', 'dial': '012345'}

   'tags': 'chaos',
   'addresses.type': 'phone',


key-value databases are fast, but they only can query by key (which is why they are fast).

  • Horizontal scaling (but with drawbacks)
  • Extremely fast
  • Can only fetch objects by key
  • Batch/map-reduce queries
  • Transactions not possible

They allow horizontal scaling (just through in another box), but the drawback is that the consistency is not perfect. It can only be queried by keys, which means they are mostly only useful for things like caching and timestamps. They often support map-reduce queries.

db.set('EVA', 'online')
state = db.get('EVA')

db.sadd('aspects', 'research')
db.sadd('aspects', 'programming')
db.sadd('aspects', 'design')
aspects = db.scard('aspects')


Spatial databases are strange beasts.

  • Knowledge of projects = useful
  • Spatial indexes really speed up some problems
  • Generally add-on to existing DB

Knowledge of projects is useful as every single region has its own projection, creating conversion confusion. Use spatial indexes because they really speed up some problems. In general they are add on to existing databases (for instance PostGIS for Postgres).


A filesystem can also act as a database. It is a hierarchical key-value store, which supports large files. They are very stable and well-known.

  • Hierarchical key-value store
  • Allows multiple writers for appends
  • Supports very large files


Graph databases are very efficient for neighbour queries, but not for anything else.

  • Allow efficient neighbour queries
  • Generally not much use for anything else


As RRD loses deliberately old data it is actually useful for logging or statistics. There old data is not needed anyway and therefore prevents running out of disk-space.

  • Deliberately loses old data
  • useful for logging or statics


It is unlikely that all the data fits in one paradigm. Big files are often already stored on filesystem instead in another database type.

Just getting a bigger server goes a long way, i.e. just a big big box instead of horizontal scaling. This prevents unnecessary pre-optimisation and work as well as headaches.

A database need to be explored and different things tried out with it. Different types of database need tested too.

Edit tutorial

Comment on This Data Unit