Roughly, “framework” = really big library that you use for just about everything in your project.
Possible lessons from the technology evaluation:
Some of the problems we have seen “solved”:
Solving these problems (and thousands of others) again is likely a waste of time.
Being a productive programmer in the modern world requires you to be good at understanding existing solution: knowing when to look for one, being able to read and understand the docs, being able to read the source when the docs are incomplete.
You don't get enough practice at that.
How do you choose a framework out of all of the options? Some questions to ask:
I wish I had a definitive answer.
A good framework should make development fast, and make the site fast. For us, fast
means both throughput (many requests handled by one server) and latency (user doesn't wait long for a response).
Most obvious way to make things fast: don't do any more work than necessary to build a page.
First step: get our code running.
We need to start with some way for a server to run our code to generate a response to a request. We want as little overhead in it as possible.
There are standard APIs that let your code be called when a request comes in, and let it return a response.
Each of these gives some structure that your code must implement. The promise is that you can then respond to HTTP requests.
Generally have a form like “create a method Controller.respond(request)
that takes a Request
object and returns a Response
object.”
Each of these has implementations that run your code, so a frontend web server (Nginx/Apache) can talk to them.
The architecture in these situations:
Communication 2↔3 by HTTP or a custom protocol/socket.
We have a bunch of parts working together:
It's also possible for your logic to run in an interpreter integrated into the HTTP server. Most common with PHP in Apache, but can be done with others.
Apache may take the role of a backend server here. There may be another frontend server (maybe also Apache) doing static files, etc.
Remember: whatever frontend framework/JavaScript stuff you have is just talking to your server by HTTP (or maybe WebSockets).
Many frontend frameworks obscure this by talking about two-way data binding
and Distributed Data Protocol
and MVVM architecture
.
It's all just a web server, HTTP(/WebSockets), browser, and JavaScript.
We want to do as little work as possible to respond to a request. Less work = faster.
We keep the language interpreter/runtime environment alive so it doesn't have to be started when a request comes in.
There might be other work necessary to create a response that you don't want to do with every request. e.g. connecting to the database server; inspecting the database structure; reading (and parsing and validating) config files; …
The architecture of the backend server should let you separate this.
For example, Java Servlets have a very clear distinction:
public class ServletExample extends HttpServlet { public void init() { ⋮ // connect to database; parse config file } protected void doGet(HttpServletRequest request, HttpServletResponse response) { ⋮ // use config; make DB queries; produce response } public void destroy() { ⋮ // shutdown DB connection } }
doGet
may be called many times on one instance.
Python's WSGI has a similar but less explicit structure:
db_connection = … config = … def application(environ, start_response): ⋮ start_response('200 OK', response_headers) return iterable_of_content_chunks httpd = make_server('localhost', 8080, application) httpd.serve_forever()
There may be other work we can avoid when producing a response.
All of it if we can cache content in the user's browser and not get the request.
Most of it if we can send a 304 Not Modified
.
Let's look for other places to avoid work…
Also known as web accelerators, but that feels ambiguous.
The goal: use the concept of a caching proxy server, but make it much more effective…
Even with a lot of optimization work, web apps can be slow.
Remember the picture of a caching proxy?
A reverse proxy has a similar role, but lives in your network:
The goal here is to do some HTTP caching, but to have all clients request through the cache.
This is achieved by naming the proxy cache example.com
and making it behave like any other HTTP server to the outside world. The user has no idea what is happening behind the proxy.
The reverse proxy can cache content for all users. That's a huge difference in load.
e.g. a Wikipedia page. Could set an expiry time 60 seconds in the future. Reverse proxy can cache it: backend only has to (do the expensive work to) generate the page once per minute.
There are several tools to do this: Varnish; Squid; Nginx + memcached.
Likely will be merged with the frontend server
: most sites would have two things doing reverse proxy + external HTTP + static/dynamic decisions + application server calls + backend page generation.
Since the cache is now within your network, stale cached data can be fixed. The proxy can allow a request from internal servers to prune cached data.
e.g. for Varnish:
PURGE /updated_page HTTP/1.0 Host: example.com
Keeping consistent data in your cache isn't easy.
Option 1: Keep the cache times short and don't worry about it.
Option 2: Longer cache times, but purge from the cache whenever something would change a page.
There are only two hard things in Computer Science: cache invalidation and naming things. Phil Karlton
There now will be many users hitting the same cache.
Good for performance, but what if users should see different content?
e.g. different Accept-language
headers; different logged-in users.
HTTP gives us the Vary
header which the proxies will honour. e.g. if we have different translations:
Vary: accept-language
But varying by the user's cookie removes most of the benefits of the reverse proxy.
Vary: accept-language, cookie
But again, we have more control over the cache when it's on our side.
e.g. Varnish has a configuration language to express how headers should be interpreted/cached.
For truly dynamic content (i.e. every person sees a different page), a reverse proxy won't help much.
You can still use caching as part of your page-building logic: use a fast in-memory cache to store data so you can retrieve is quickly later.
Likely candidates: Memcached, Redis.
Each of these give you somewhere that you can quickly store and retrieve data by key.
As usual, your framework can probably help.
Some common use-cases for caching in a web app…
Can cache a fragment of a rendered template, so it doesn't have to be built often. e.g. in Rails: *
<% @products.each do |product| %> <% cache product do %> <%= render product %> <% end %> <% end %>
Can cache a controller/view response by arguments. e.g. in Express: *
var cache = require('apicache').middleware; app.get('/api/collection/:id?', cache('5 minutes'), function(req, res) { … });
Or in Django:
@cache_page(120) def view_collection(request, id): …
Or you will probably find a lower-level API that you can use as a programmer. e.g. Django
from django.core.cache import cache def complicated_result(arg): key = 'complicated_result-' + str(arg) cached_result = cache.get(key) if cached_result is not None: return cached_result else: result = _build_complicated_result(arg) cache.set(key, result, 120) return result def _build_complicated_result(arg) ⋮ # do the expensive calculations
Again, having consistent data can be tricky.
All caches are effectively storing denormalized data: you need to give some thought to how to update every copy of a fact when necessary.
… or simply accept that the cache will sometimes be a little behind.
Look for an opportunity to cache in your logic if you have:
After all that, we imagine a production site has a setup something like this: * * * *
This is a three-tier architecture.
We imagine each would be a separate computer with different specs. In our project, we're doing it all in one computer (VM or Docker host) for simplicity, but all of the ideas are there.
All of the caching, optimization, and clever programming we imagine will eventually run out: the servers will eventually be unable to keep up with the load.
The solution: replicate the roles. *
We add a load balancer that can do the (easy) job of passing requests back and forth. Could also be the reverse proxy.
The frontend and backend servers aren't storing anything, so it's trivial to have many identical copies. State is stored only in the DB.
The replicated roles also give us redundancy.
If one web server fails (or needs to be rebooted or…) then the site won't go down, as long as the load balancer adapts.
For stateless workers, multiple servers in one tier is easy.
We don't have a way to replicate the load balancer (yet).
For database servers, things are trickier.
Multiple relational database servers coordinating and ensuring ACID is hard.
Can have read-only load balancing (if you have high read to write ratio).
Can have failover to a replica (for high availability).
For a truly distributed database, you probably need to look for a NoSQL database, and choose which of the ACID/CAP things you're willing to give up.
Or to handle extra load, shard the data manually across many servers.
Honest advice: don't do this until you really need to, and understand the bottlenecks.
It's easy to complicate the three-tier
picture further and have servers for many things like…
The lesson: you can have as many servers as you can imagine. Complexity increases accordingly.
These could all be separate physical servers or VMs, or run on the same VM. Depends on your needs.
As complexity of your infrastructure increases, it can become more difficult to coordinate the necessary work.
e.g. need to invalidate n cached pages because a user changed their profile; need to log many things to log server; need to update full-text index but that's slow; …
Some of these should be synchronous (i.e. completed before response is sent) but some can be asynchronous (i.e. done soon, but don't make the user wait).
Need to sends messages between processes/servers? Consider a message broker.
The idea: some server has a message that must be processed: message broker receives it and ensures delivery. Another server listens for messages and handles them.
Notable examples: RabbitMQ, ZeroMQ, Kafka, Amazon SQS.
These let you easily (and robustly) queue up work, and be sure it will happen eventually.
Useful for not-quite-synchronous tasks: logging user actions, updating full-text search, sending email.
You can also use a task queue. Often built on a message broker, but gives you higher-level semantics.
Basically: you make a (modified) function call. A worker does it and sends the result back. You can wait for the result, or not.
e.g. Celery (Python), beanstalkd (Ruby), Kue (JavaScript).
As your setup gets complicated, managing it can become more and more difficult. When replicating services, it's critical that they be configured identially.
Configuration management becomes more important for both consistency and documenting the setup.
With configuration management (Chef or Docker or similar), it should be easy to…
We also have decisions to make about where each process lives.
e.g. are load balancing and reverse-proxy caching happening on the same hardware or not? Either is reasonable, depending on the load/usage. What if you want to move one?
This leads to containers and Docker.
The idea: run your services in isolated lightweight containers. They are “smaller” than VMs, but more isolated than processes. Each container runs one “thing”.
e.g. a container for Nginx; another for Unicorn/Rails; another for PostgreSQL.
Containers communicate only over network ports.
That prevents any shared state between them: all interactions should be well understood, which is a good idea. That lets you move them around without too much pain.
Docker Swarm, Kubernetes, Nomad can orchestrate
containers, deploying/restarting/moving them as necessary.
But…
You still need to run the containers somewhere: you need a VM or server that's configured to run them. Then configure the containers.
The good architecture of only communicate by network
can be done without containers, if you're a little disciplined.
Actually buying, powering, and replacing physical servers is even more of a pain. If you become more popular, you have to wait for hardware to scale up.
Solution: let somebody else buy and operate computers.
There is no cloud. It's just someone else's computer. Chris Watterston
There are a few ways cloud services can be bought.
The usual categories…
Software as a Service (SaaS): an application is provided.
e.g. Gmail email, Github for source control.
Platform as a Service (PaaS): the provided platform can execute your logic, as long as you follow their rules.
e.g. Google App Engine, Amazon Lambda, MS Azure App services.
Using PaaS is roughly: you write a function (for the prototype they specify), and connect it to an event (like an HTTP request to a particular path).
They make sure it runs somewhere. You accept the limitations of their environment and pay per call/CPU second.
But it's easy to lock yourself into their environment.
Infrastructure as a Service (IaaS): you get VMs (or maybe containers) and configure them as you like. Provision/destroy VMs as needed.
e.g. Amazon EC2, Google Compute Engine, Digital Ocean.
IaaS is probably what our project setup is simulating (at least the Vagrant/Chef option).
You still need to think about the VMs being ephemeral: the underlying hardware may fail. (The cloud doesn't prevent reality from existing.) Reliable storage is probably best done elsewhere.
But…
The planet is too big. Or the speed of light is too slow.
(circumference of earth)/(speed of light) = 134ms
A request/response around the planet will take at least that long. Intercontinental bandwidth isn't all you might hope for either.
That means that if you're serving a global audience, having server(s) in one location is likely a problem.
This isn't too hard to deal with for static content.
A content delivery network is basically a worldwide network of caching proxy servers. Users can contact the closest server and get the static resources from there.
e.g. Amazon CloudFront, Google Cloud CDN, Akamai.
Special purpose: CDNjs, Google Hosted Libraries.
The only trick needed is a slightly-smarter DNS server. It must give the user the IP address of the server that's “best” for them.
Distributing dynamic content is going to be harder.
But most of the bytes on most pages are going to be static anyway.
For better or worse, I drew this: