Databases

Databases

If we're building dynamic pages, we're going to need to assemble the data for them quickly: retrieve the specific record/records needed. (Or write them when saving data.)

Good thing somebody invented databases.

Databases

In particular, we generally need OLTP (Online Transaction Processing) workloads, not OLAP (Online Analytic Processing). Many small queries/​updates, not big batch analysis.

Relational Databases

Of course, relational databases are good at fetching and storing data quickly.

Don't forget that stuff you learned (or somebody tried to teach you) about them. In particular…

Relational Databases

Data normalization.

If you have the same fact stored in multiple places, you're using extra storage, and it's going to be a pain to update later. You're almost certainly going to update n-1 of the copies and have consistency problems.

Keep your data normalized and join when you need to (until proven otherwise).

Relational Databases

Keys and indexes.

Having the right key/index is the difference between O(n) and ≤ O(log n) time for lookup.

You won't notice when testing, but will as your data grows.

Relational Databases

Queries have some overhead, and joins can prevent them.

It's easy to end up with logic in your application like this:

orders = query('SELECT … FROM orders WHERE …')
for o in orders:
    client = query('''SELECT … FROM clients
            WHERE id={}''', o.client_id)
    # do something with the order and client…

This requires n+1 queries, which is expensive.

… but it's usually not that obvious: the loop and query can be hidden by an ORM and in an HTML template.

Relational Databases

This requires one query, and will be much faster.

orders = query('''SELECT o.…, c.…
        FROM orders o JOIN clients c
        ON o.client_id=c.id
        WHERE …''')
for o in orders:
    # do something with the order and client…

Again, you won't notice in testing with 3 records.

Relational Databases

You should know how to work with relational databases: SQL + a programming language.

But maybe you shouldn't actually do it. You probably won't write the SQL and will convince an ORM to behave well. (More later.)

Relational DB Servers

There are many relational database options.

There are several you can pay a lot for: IBM DB2, Oracle, Teradata, MemSQL. Let's leave those aside for now and not pay for something if the free ones will do the job.

Relational DB Servers

For doing basic storage for a web system, one of the open source relational databases is probably the default choice. The choices are likely:

  • MySQL/MariaDB
  • PostgreSQL

Relational DB Servers

Choosing between MySQL and PostgreSQL seems to be largely a matter of taste.

Both are good SQL databases. Both are well tested and currently storing trillions of rows of data out there without losing them.

Relational DB Servers

PostgreSQL has some extra features: JSON data type, GIS data, alter tables atomically.

MySQL is more commonly used and possibly faster.

Your choice.

Relational Cloud DB

The other option to store your data: let somebody else do it.

There are several cloud services that do relational database things, and the cloud provider takes care of the rest.

Notable examples: Amazon RDS, Amazon Aurora, SQL Azure, Google Cloud SQL.

Relational Cloud DB

The good: they worry about the administration of the servers, backups, etc.

But you pay for the convenience.

(More about cloud services later.)

Development DB

You need a “real” database server for a production web site.

But for development, it can be a pain. Setting up PostgreSQL just to manage some data for a developer to experiment probably isn't worth it.

Development DB

SQLite is an embedded database: it's a library that manages a file on disk so it looks to you like a full relational database.

Gives you all of the functionality you need to work with relational data, with none of the administration.

Development DB

It's not going to replace MySQL/​Postgres for workloads with high volume or concurrency, but that's not its job.

Also useful for storing data that's highly structured to you, but just looks like a file to users.

See also Derby: a similar tool in the Java ecosystem.

Object-Relational Mapping

Actually writing SQL is boring and repetitive. Most of what you end up doing:

  • Read and write data
  • … linked by primary keys.
  • … artificial primary keys.
  • … that you're going to treat as objects in your code.

You don't end up writing left semi joins with subqueries very often. It's mostly SELECT… JOIN… WHERE.

Object-Relational Mapping

You're going to have tables like:

idfnamelnamedepartment_id
1GregBaker7

And your program does the equivalent of:

p = Person(1)
p.fname = 'Greg'
p.lname = 'Baker'
p.dept = Department(…)
p.save()

… and the opposite to save to the DB.

Object-Relational Mapping

And you'd have to do it a lot: for every table/​class. Maybe for every type of query. Yawn.

An object-relational mapper automatically connects your tables and objects in the obvious way.

Object-Relational Mapping

You end up writing code something like this:

p = Person()
p.fname = 'Greg'
p.lname = 'Baker'
p.dept = Department.get(id=7)
p.save()

people = Person.findall(fname='Greg')

All of your DB manipulation with no SQL.

Object-Relational Mapping

It's not completely free. If you're stupid with the ORM, it will produce stupid queries. Try to keep in mind how what you say in the ORM will turn into SQL.

You still need indexes. You still need to keep the number of queries under control.

Object-Relational Mapping

The n+1 queries problem is easy to cause with an ORM.

<% @people.each do |person| %>
  <li><%= person.lname %>: <%= person.department.name %></li>
<% end %>

Is every iteration of the loop going to have to query for the person.department object? It depends how you fetched the people collection with the ORM. e.g. in Rails:

people = People.all          # n+1 queries about to happen
people = People.includes(:department).all        # 1 query

MVC

Many web development frameworks are organized as model–view–controller architectures.

Let's get it out of the way before the technology evaluation.

MVC

The model is about managing the data, and the logic that goes with manipulating it.

That's probably going to mean classes/​objects that represent the data. That's probably going to mean an ORM.

The model should capture good class design (for the code) and good relational data design (for the DB).

MVC

The view layer generates output for the user that represents the information.

For us, that's usually going to mean generating HTML. The HTML will probably be generated by an HTML template

MVC

You write templates like: (Handlebars template)

<h1>{{title}}</h1>
<ul>
  {{#each direction}}
    <li>{{this.label}}</li>
  {{/each}}
</ul>

… give it some (ORM) data and generate HTML like:

<h1>All Directions</h1>
<ul>
    <li>Up</li>
    <li>Left</li>
    <li>Turnwise</li>
</ul>

MVC

The controller responds to user actions. For us, usually HTTP requests.

Requests must be mapped to logic: URL routing or dispatching. e.g. request for a URL /schools/sfu/majors gets a response from a call list_majors('sfu').

MVC

Then the logic must be written that does the work. Probably involves using the model to get/​modify data, and the view to present it.

e.g. list_majors('sfu') has to look up some stuff with the ORM, build HTML from a template, and return something that specifies the HTTP response.

The controller is responsible for putting the other pieces together.

MVC

Different frameworks use these concepts and terms differently, but the concepts are still recognizable.

Some frameworks call themselves MVC, MVT, MVP, MVVM.

Relational Limitations

Imagine you're creating a large scale site with users around the world.

Requirements for database: High data “velocity” (lots of reads/​writes). High data “volume” (lots of data). Maybe too much of either for any single server.

Relational Limitations

Having servers in a single place probably isn't a good idea.

global request/response *

Making a request from the other side of the planet takes a round-trip of at least 134 ms (plus router delay, plus actually building the response, plus then downloading CSS/​JS/​images).

We're going to want to get servers (including DB servers) as close to the user as possible.

Relational Limitations

A traditional relational + ACID (Atomicity, Consistency, Isolation, Durability) starts to sound unrealistic.

SQL databases generally rely on locking to ensure ACID writes. Maintaining a lock across the planet is a ≥134 ms problem.

What if there's a network failure in between servers? Then what?

Relational Limitations

Joining tables also gets hard at high-volume.

How do you do an inner-join on terabyte-sized tables partitioned across many servers?

Relational Limitations

Maybe we have to give up on ACID + relations at this scale.

NoSQL

Out of this thinking comes new non-relational databases, called NoSQL. These are all very different, and it's hard to say much in general, but…

NoSQL

General features of NoSQL data stores:

  • simpler data model: no foreign keys or joins;
  • … so you end up denormalizing data;
  • weakened ACID guarantees;
  • … allowing distributed/​redundant servers;
  • querying that isn't (exactly) SQL.

NoSQL

Example: Cassandra. Querying with CQL, which is a lot like SQL:

SELECT name, occupation
   FROM users
   WHERE userid IN (199, 200, 207);
INSERT INTO users (name, occupation)
   VALUES ('Greg', 'Lecturer');

Missing: JOIN functionality; BEGIN TRANSACTION.

NoSQL

Added: options for how much replication you'd like in your data, and how much consistency you want to wait for.

CREATE KEYSPACE user_data
    WITH replication = {'class': 'SimpleStrategy',
        'replication_factor': 3};
CREATE KEYSPACE other_stuff
    WITH replication = {'class': 'NetworkTopologyStrategy', …};

CONSISTENCY quorum; -- at least half write before returning
INSERT INTO users …;
CONSISTENCY any; -- one writes before returning
INSERT INTO users …;

NoSQL

Result: you can read and write from many distributed servers. You can scale really big.

But, you can no longer join, and have to rearrange your data accordingly. Your ORM won't work as you expect.

NoSQL

Worse: the ACID guarantees you rely on for data coherence are partially gone.

Maybe you don't care much: if you lose 0.0001% of comment up/down votes, does it matter? If you lose 0.0001% of bank transactions, it does.

Maybe you have no choice: Facebook Messenger isn't going to fit in any traditional relational DB.

NoSQL

You may need multiple data storage technologies.

Some data may fit well in Cassandra, but PostgreSQL is best for other stuff you store. The choice isn't always easy, and you'll have to rewrite a lot of code to migrate.

Except Mongo. Never Mongo.

Yes SQL

But you're not writing Facebook Messenger. When you have zero customers, guessing what data manipulations you'll need with a thousand/​million/​billion isn't possible.

Twitter used MySQL until at least 2012. Amazon Aurora can scale a lot. NewSQL databases are appearing, like CockroachDB.

Yes SQL

My advice: Use a relational database. Normalize your data. Use the ORM. Get a product working.

If one day it won't scale anymore, then worry about optimizing what you actually do.

Yes SQL

Programmers waste enormous amounts of time thinking about, or worrying about, the speed of noncritical parts of their programs, and these attempts at efficiency actually have a strong negative impact when debugging and maintenance are considered. We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%. Donald Knuth, 1974