If we're building dynamic pages, we're going to need to assemble the data for them quickly: retrieve the specific record/records needed. (Or write them when saving data.)
Good thing somebody invented databases.
In particular, we generally need OLTP (Online Transaction Processing) workloads, not OLAP (Online Analytic Processing). Many small queries/updates, not big batch analysis.
Of course, relational databases are good at fetching and storing data quickly.
Don't forget that stuff you learned (or somebody tried to teach you) about them. In particular…
If you have the same fact stored in multiple places, you're using extra storage, and it's going to be a pain to update later. You're almost certainly going to update n-1 of the copies and have consistency problems.
Keep your data normalized and join when you need to (until proven otherwise).
Keys and indexes.
Having the right key/index is the difference between O(n) and ≤ O(log n) time for lookup.
You won't notice when testing, but will as your data grows.
Queries have some overhead, and joins can prevent them.
It's easy to end up with logic in your application like this:
orders = query('SELECT … FROM orders WHERE …') for o in orders: client = query('''SELECT … FROM clients WHERE id={}''', o.client_id) # do something with the order and client…
This requires n+1 queries, which is expensive.
… but it's usually not that obvious: the loop and query can be hidden by an ORM and in an HTML template.
This requires one query, and will be much faster.
orders = query('''SELECT o.…, c.… FROM orders o JOIN clients c ON o.client_id=c.id WHERE …''') for o in orders: # do something with the order and client…
Again, you won't notice in testing with 3 records.
You should know how to work with relational databases: SQL + a programming language.
But maybe you shouldn't actually do it. You probably won't write the SQL and will convince an ORM to behave well. (More later.)
There are many relational database options.
There are several you can pay a lot for: IBM DB2, Oracle, Teradata, MemSQL. Let's leave those aside for now and not pay for something if the free ones will do the job.
For doing basic storage for a web system, one of the open source relational databases is probably the default choice. The choices are likely:
Choosing between MySQL and PostgreSQL seems to be largely a matter of taste.
Both are good SQL databases. Both are well tested and currently storing trillions of rows of data out there without losing them.
PostgreSQL has some extra features: JSON data type, GIS data, alter tables atomically.
MySQL is more commonly used and possibly faster.
Your choice.
The other option to store your data: let somebody else do it.
There are several cloud services that do relational database things, and the cloud provider takes care of the rest.
Notable examples: Amazon RDS, Amazon Aurora, SQL Azure, Google Cloud SQL.
The good: they worry about the administration of the servers, backups, etc.
But you pay for the convenience.
(More about cloud services later.)
You need a “real” database server for a production web site.
But for development, it can be a pain. Setting up PostgreSQL just to manage some data for a developer to experiment probably isn't worth it.
SQLite is an embedded database: it's a library that manages a file on disk so it looks to you like a full relational database.
Gives you all of the functionality you need to work with relational data, with none of the administration.
It's not going to replace MySQL/Postgres for workloads with high volume or concurrency, but that's not its job.
Also useful for storing data that's highly structured to you, but just looks like a file to users.
See also Derby: a similar tool in the Java ecosystem.
Actually writing SQL is boring and repetitive. Most of what you end up doing:
You don't end up writing left semi joins with subqueries very often. It's mostly SELECT… JOIN… WHERE
.
You're going to have tables like:
id | fname | lname | department_id |
---|---|---|---|
1 | Greg | Baker | 7 |
And your program does the equivalent of:
p = Person(1) p.fname = 'Greg' p.lname = 'Baker' p.dept = Department(…) p.save()
… and the opposite to save to the DB.
And you'd have to do it a lot: for every table/class. Maybe for every type of query. Yawn.
An object-relational mapper automatically connects your tables and objects in the obvious way.
You end up writing code something like this:
p = Person() p.fname = 'Greg' p.lname = 'Baker' p.dept = Department.get(id=7) p.save() people = Person.findall(fname='Greg')
All of your DB manipulation with no SQL.
It's not completely free. If you're stupid with the ORM, it will produce stupid queries. Try to keep in mind how what you say in the ORM will turn into SQL.
You still need indexes. You still need to keep the number of queries under control.
The n+1 queries problem
is easy to cause with an ORM.
<% @people.each do |person| %> <li><%= person.lname %>: <%= person.department.name %></li> <% end %>
Is every iteration of the loop going to have to query for the person.department
object? It depends how you fetched the people
collection with the ORM. e.g. in Rails:
people = People.all # n+1 queries about to happen
people = People.includes(:department).all # 1 query
Many web development frameworks are organized as model–view–controller architectures.
Let's get it out of the way before the technology evaluation.
The model is about managing the data, and the logic that goes with manipulating it.
That's probably going to mean classes/objects that represent the data. That's probably going to mean an ORM.
The model should capture good class design (for the code) and good relational data design (for the DB).
The view layer generates output for the user that represents the information.
For us, that's usually going to mean generating HTML. The HTML will probably be generated by an HTML template
You write templates like: (Handlebars template)
<h1>{{title}}</h1> <ul> {{#each directions}} <li>{{this.label}}</li> {{/each}} </ul>
… give it some (ORM) data and generate HTML like:
<h1>All Directions</h1> <ul> <li>Up</li> <li>Left</li> <li>Turnwise</li> </ul>
The controller responds to user actions. For us, usually HTTP requests.
Requests must be mapped to logic: URL routing or dispatching. e.g. request for a URL /schools/sfu/majors
gets a response from a call list_majors('sfu')
.
Then the logic must be written that does the work. Probably involves using the model to get/modify data, and the view to present it.
e.g. list_majors('sfu')
has to look up some stuff with the ORM, build HTML from a template, and return something that specifies the HTTP response.
The controller is responsible for putting the other pieces together.
Different frameworks use these concepts and terms differently, but the concepts are still recognizable.
Some frameworks call themselves MVC, MVT, MVP, MVVM.
Imagine you're creating a large scale site with users around the world.
Requirements for database: High data “velocity” (lots of reads/writes). High data “volume” (lots of data). Maybe too much of either for any single server.
Having servers in a single place probably isn't a good idea.
Making a request from the other side of the planet takes a round-trip of at least 134 ms (plus router delay, plus actually building the response, plus then downloading CSS/JS/images).
We're going to want to get servers (including DB servers) as close to the user as possible.
A traditional relational + ACID (Atomicity, Consistency, Isolation, Durability) starts to sound unrealistic.
SQL databases generally rely on locking to ensure ACID writes. Maintaining a lock across the planet is a ≥134 ms problem.
What if there's a network failure in between servers? Then what?
Joining tables also gets hard at high-volume.
How do you do an inner-join on terabyte-sized tables partitioned across many servers?
Maybe we have to give up on ACID + relations at this scale.
Out of this thinking comes new non-relational databases, called NoSQL. These are all very different, and it's hard to say much in general, but…
General features of NoSQL data stores:
Example: Cassandra. Querying with CQL, which is a lot like SQL:
SELECT name, occupation FROM users WHERE userid IN (199, 200, 207); INSERT INTO users (name, occupation) VALUES ('Greg', 'Lecturer');
Missing: JOIN
functionality; BEGIN TRANSACTION
.
Added: options for how much replication you'd like in your data, and how much consistency you want to wait for.
CREATE KEYSPACE user_data WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 3}; CREATE KEYSPACE other_stuff WITH replication = {'class': 'NetworkTopologyStrategy', …}; CONSISTENCY quorum; -- at least half write before returning INSERT INTO users …; CONSISTENCY any; -- one writes before returning INSERT INTO users …;
Result: you can read and write from many distributed servers. You can scale really big.
But, you can no longer join, and have to rearrange your data accordingly. Your ORM won't work as you expect.
Worse: the ACID guarantees you rely on for data coherence are partially gone.
Maybe you don't care much: if you lose 0.0001% of comment up/down votes, does it matter? If you lose 0.0001% of bank transactions, it does.
Maybe you have no choice: Facebook Messenger isn't going to fit in any traditional relational DB.
You may need multiple data storage technologies.
Some data may fit well in Cassandra, but PostgreSQL is best for other stuff you store. The choice isn't always easy, and you'll have to rewrite a lot of code to migrate.
Except Mongo. Never Mongo.
But you're not writing Facebook Messenger. When you have zero customers, guessing what data manipulations you'll need with a thousand/million/billion isn't possible.
Twitter used MySQL until at least 2012. Amazon Aurora can scale a lot. NewSQL databases are appearing, like CockroachDB.
My advice: Use a relational database. Normalize your data. Use the ORM. Get a product working.
If one day it won't scale anymore, then worry about optimizing what you actually do.
Programmers waste enormous amounts of time thinking about, or worrying about, the speed of noncritical parts of their programs, and these attempts at efficiency actually have a strong negative impact when debugging and maintenance are considered. We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%. Donald Knuth, 1974