HTTP and Servers

HTTP is the protocol used to transmit data around the WWW. HyperText Transfer Protocol, if you like acronyms.

X is on the web == X can be fetched by HTTP.

HTTP and Servers

HTTP is a relatively simple protocol, but it does things that many web developers don't know about. Knowing what the protocol can do will help you make better web systems.

The HTTP Request

Let's look at a simple HTTP request: the user has clicked on a link to http://www.sfu.ca/~ggbaker/test.html.

The user agent (client/browser) initiates the conversation by making a TCP connection to www.sfu.ca on port 80 (the default)…

The HTTP Request

and it sends a request like this:

GET /~ggbaker/test.html HTTP/1.1
Host: www.sfu.ca
User-Agent: Mozilla/5.0
Accept: text/html,application/xhtml+xml,*/*;q=0.8
Accept-Language: en
Accept-Encoding: gzip, deflate, br
Connection: keep-alive

…followed by a blank line.

The HTTP Request

The first line is the request line.

GET /~ggbaker/test.html HTTP/1.1

GET: the request method. (More later.)
/~ggbaker/test.html: the path of the URL (could include query string).
HTTP/1.1: the HTTP version.

The HTTP Request

The following lines are request headers. Each has a field name and value. e.g.

Host: www.sfu.ca
User-Agent: Mozilla/5.0

The HTTP Request

The Host header is required.

This allows for name-based virtual hosting. i.e. multiple sites on one server and IP address.

Other headers can give info about the browser or user.

The HTTP Request

The blank line indicates the end of the request.

… or start of the request body if there is one.

Once the request is done, the server sends a response…

Request Methods

The request method indicates the overall “action”.

GET is used any time you click a link or type a URL.

Request Methods

When an HTML <form> is submitted, it can be either:

The GET method, with form contents encoded in the URL query string: http://…/foo/bar?name=Greg&cost=6
The POST method, with form contents encoded in the request body (after that blank line).

Request Methods

Practical differences:

Passwords shouldn't be over-the-shoulder visible in the URL, so should be POST.
Can't bookmark a POST request, since not everything needed for the request is in the URL.

Request Methods

Semantic differences:

GET requests should be safe. That is, have no (significant) side effects. They can be safely reloaded.
If you're changing something (≈ writing to the database), it should be a POST.

It's common to use the Post/Redirect/Get pattern is used to avoid users accidentally reloading POST pages.

Request Methods

We will see other HTTP request methods later when we talk about REST.

The HTTP Response

The server's response will look something like this:

HTTP/1.1 200 Okay
Server: nginx/1.16.0
Content-type: text/html; charset=utf-8

<!DOCTYPE html>
<html><head>…

The HTTP Response

The first line is the status line.

HTTP/1.1 200 Okay

HTTP/1.1: the response HTTP version
200: the status code indicating success/failure of the request. (More later.)
Okay: the reason phrase, a human-readable version of the status code.

The HTTP Response

The following lines are response headers. Same format as the request headers. e.g.

Server: nginx/1.16.0
Content-type: text/html; charset=utf-8

Last line there says that the content is HTML (Internet Media Type text/html) with characters encoded as UTF-8.

The HTTP Response

Again, a blank line to separate the headers from the response body.

For a 200 response, the body contains the actual contents of the resource that was requested. It can be empty for some responses.

Aside: conversation format

The HTTP message format: 1 special line (request/status line), n field/value headers, blank line, message body.

It's not unique to HTTP. It's borrowed from RFC 822, email. Email has headers like To and Subject, but the format is the same.

Sometimes the terminology leaks and the body is called the message body.

HTTP/2

All of those examples were HTTP/1. The current version is HTTP/2.

Conceptually, everything is the same in HTTP/2: methods, status codes, headers, URLs. All of the same ideas as HTTP/1.1.

HTTP/2

That info is encoded in a more efficient binary format. Fewer bytes; easier parsing; not human-readable.

A remarkably minimal change from HTTP/1.0 in 1996.

As a developer, you shouldn't have to worry much: let the tools translate your method calls into the right encoding.

Status Codes

The status code in the response indicates what kind of response is being sent.

The first digit indicates the overall type.

1xx is informational

Status Codes

2xx codes indicate success.

200 Okay: Everything is fine; contents of resource follow in message body.
206 Partial Content: User agent requested only part of the resource (with Range header); that part is being sent.

Status Codes

3xx codes are redirection, but really more like things the user doesn't have to worry about.

304 Not Modified: Can be returned if the browser's cached copy is current, if the request was sent with an If-modified-since or If-none-match header. These indicate the age and specific content cached.
No body is returned.

Status Codes

301 Moved Permanently
302 Found
303 See Other
307 Temporary Redirect: Various flavours of redirect. All come with a Location header with the new URL. User is taken there automatically.
The difference is semantic: how can bookmarks, search indexes, etc adapt?

Status Codes

4xx codes are client errors: things the user (or user agent) did wrong.

403 Forbidden: You don't have permission to do (view?) that.
404 Not Found: The server can't find the requested resource. Should only be seen after a typo: old URLs should return one of 301, 303, 410 forever.
410 Gone: The resource isn't available anymore. The informative version of 404.

Status Codes

5xx codes are server errors: things that are problems on the server which the user can't hope to fix.

500 Internal Server Error: Something bad happened and it wasn't the client's fault. Often an exception was thrown.
502 Bad Gateway
504 Gateway Timeout: The server is proxying a backend server, and it isn't responding properly.
503 Service Unavailable: Temporary outage/overload.

Web Servers

The basic job of a web server is the same as any other server in a client/server conversation: when requests come in, send a response.

A web server does it by speaking HTTP.

Web Servers

There are many HTTP servers to choose from. Some of the most popular:

Apache: open source, the default choice for many years.
nginx: open source, fast. Probably the new default.
Microsoft IIS.
Application servers that run backend code behind-the-scenes. (More later.)

Web Servers

The job of a web server is more complicated than others like SSH, FTP (in some ways).

Web content can come from many different sources. The server must be configured to do the right thing for each request.

Web Servers

One possibility: the requested content is in a file on disk. This is static content.

That's what you're creating for exercises 1 and 2.

Web Servers

Content can also be generated when it is requested by a user agent: dynamic content. This allows content that changes frequently, or is different for each user, or ….

Whether content is static or dynamic is (possibly) invisible to the user.

Web Servers

The server must know what to do for each request.

How to configure that depends on the server (and programming language/framework). It can be more painful than you'd expect.

Web Servers

Remember that even for a site with lots of dynamic content, there will still be many static resources (stylesheets, images, JavaScript, etc).

It's easy to spend hours optimizing your backend dynamic code, but then be slowed down by 3 MB of CSS + JS + images.

Web Servers

Another possibility: the server that gets the request from the user doesn't do the work, it forwards the request to another (application) server behind it, and then passes along the response.

It's acting as a gateway or reverse proxy or load balancer or frontend server. More later.

Server-Side Programming

or backend programming. [vs JavaScript/client-side/frontend]

There are many tools for server-side programming. Too many to talk about now. (Rails, Django, Snap, Play, Express, …)

Choice of server-side framework/tools will be a big part of the technology evaluation, and the course.

Server-Side Programming

There are many ways to connect your logic to a web server. Some are language-specific, some are intended for development only. Most of the time, your framework/tools will make the decision for you.

But let's see a quick example using WSGI, the common method for Python code to interface with a web server.

Server-Side Programming

A minimal WSGI application:

def application(environ, start_response):
    status = '200 OK'
    response_headers = [
        ('Content-Type', 'text/plain; charset=utf-8'),
    ]
    start_response(status, response_headers)
    content = 'This is content. \U0001F600'
    return [content.encode('utf-8')]

The web server is given this function, and can call it to generate content as needed. [complete code]

Server-Side Programming

Input to server-side programs comes from a request (form submission, URL contents, etc). Output is an HTTP response.

Other than that, it's all just programming: do some computation, make some DB queries, write a file, whatever.

Server-Side Programming

There are many common problems to solve: how to get data to and from a database, how to connect URLs to logic that generates a response, how to present data in HTML, ….

This leads to server-side development frameworks to help with common tasks.

Server-Side Programming

e.g. Rails, Django, Snap, Play, Express.

These frameworks (and client-side frameworks and other tools) will be the subject of the technology evaluation.

But they are all fundamentally ways to more easily/flexibly/robustly create an HTTP response, like the CGI script a few slides ago.

Character Sets/Encodings

It's easy to say your HTML is sent by HTTP, but wrong. Networks don't send characters, they send bits. Characters must be encoded as bits for transport.

The source and destination have to agree on how to convert characters ↔ bits. If not, characters will be displayed wrong.

It's easy to get wrong if you're not paying attention.

Character Sets/Encodings

A character set is a numbered list of characters. Here are a few we might number:

Number	Character
65	A
97	a
8838	⊆
63935	樂
128169	💩

We need to number everything we want to treat as a character.

Character Sets/Encodings

In the early days, it was easy: ASCII encodes 95 characters (numbered 32–126, plus some control characters).

Store each one as a byte, and it's good enough for English text.

But there are other languages too…

Character Sets/Encodings

ASCII characters are numbered <128, so we still have half the byte left. That's enough for Western Europe, or Greek, or Hebrew, but only one at a time.

It turns out, people want to include several languages in a document. Also, Chinese and Arabic exist. And math. And the poo emoji.

Character Sets/Encodings

The Unicode character set is designed to handle all of the worlds writing systems. It comes to about 145k characters or code points (with space for >1M).

ŤИאඌଧჶགᛒℕ∮⌘⽩가𐤇𝄞𝌩🁊🀗🍔😝🚕

Character Sets/Encodings

A character encoding is a way to convert the character numbers ↔ bits.

With Unicode, it's not obvious how to turn characters into bits. Three or four bytes for each character? Seems wasteful since all ASCII characters (HTML tags, CSS properties, etc) are numbered <128.

Character Sets/Encodings

The solution: UTF-8. It's a very clever way to encode Unicode characters efficiently. It's the only realistic choice for the modern web.

In particular, all ASCII characters are encoded just like ASCII.

Character Sets/Encodings

Always explicitly declare your character encoding to ensure the user agent gets it right. In HTML5:

<meta charset="utf-8" />

In CSS:

@charset "utf-8";

In HTTP headers:

Content-type: text/html; charset=utf-8

Character Sets/Encodings

Lessons for developers:

Always be explicit about what character encoding you're producing.
Never guess or assume the character encoding for input (text or files).
… even by accepting some default.
Never assume one byte == one character.

Character Sets/Encodings

The #AskObama tweet was the result of a left-single-quote apostrophe correctly encoded as UTF-8 but incorrectly decoded using some system's default.

>>> t = u'let\U00002019s'
>>> print(t)
let’s
>>> print(t.encode('utf-8'))
b'let\xe2\x80\x99s'
>>> print(t.encode('utf-8').decode('windows-1252'))
letâ€™s

Redirects

We have seen the HTTP status codes 301, 302, 303, 410 for moved resources.

These are sent with a Location header to indicate the new location, and the user agent will go there automatically.

HTTP/1.1 301 Moved Permanently
Location: http://example.com/newlocation
Content-Type: text/html; charset=utf-8

<title>301 Moved Permanently</title>
<h1>Moved Permanently</h1>
The document has moved
<a href="http://example.com/newlocation">here</a>.

Redirects

There are several ways to produce redirects in a system you create.

The web server can do it. e.g. Nginx rewrite. Most useful:

Whole sites: http://www.example.com/ → https://example.com/
Whole directories: http://example.com/Project_x/ → http://example.com/X/
Static content: …/style.css → …/styles/main.css

Redirects

Your application logic can do it. e.g. Django redirect like this Post-Redirect-Get pattern:

def new_object_view(request):
    if request.method == 'POST':
        new_obj = …
        new_obj.save()
        return redirect(new_obj.get_absolute_url())

Or in a URL pattern:

urlpatterns = [
    path('details/<int:pk>/',
        RedirectView.as_view(url=reverse(…)),
        name='old-details-view'),
]

Redirects

Doing the redirect in your logic lets you be more precise. e.g.

Post-Redirect-Get pattern.
Must log in to view this → temporary redirect to the login page.
URL refers to an object that has been marked deleted → 410 Gone or 303 See Other.
Request for your old URL structure → permanent redirect to corresponding location in new structure.

Redirects

General principle: URLs should never become invalid.

When planning a site/app, take some time to come up with a good URL schema and plan for the future. Redirect if it's really necessary to move stuff.

When you put a URL into production, you should be committing to have it work until the end of time.

Content Negotiation

Different HTTP user agents have different capabilities. So do their users. e.g. file types handled, natural languages read, etc.

HTTP provides a way to automatically and transparently deal with some of these: content negotiation.

Content Negotiation

With every request, browsers send HTTP headers indicating their capabilities using the Accept* headers.

GET /~ggbaker/test.html HTTP/1.1
Host: cmpt470.csil.sfu.ca
Accept: text/html,application/xhtml+xml,⏎⃠
        application/xml;q=0.9,*/*;q=0.8
Accept-Language: en,fr;q=0.5
Accept-Encoding: gzip, deflate, br

Content Negotiation

All share a “quality” syntax. Quality of every value is 0–1.

Accept-foo: a,b;q=0.9,c,*;q=0.1

That's a at quality 1.0; b at 0.9; c at 0.9 (since they must be non-increasing); and anything else at quality 0.1.

Content Negotiation

The Accept header gives media types that the browser likes.

Accept: text/html,application/xhtml+xml,⏎⃠
        application/xml;q=0.9,*/*;q=0.8

This says that it prefers HTML or XHTML (with quality 1.0). Other XML slightly worse (quality 0.9). Anything else okay (quality 0.8).

Content Negotiation

Browsers change this to indicate useful things for different kinds of requests. For a <link rel="stylesheet" />:

Accept: text/css,*/*;q=0.1

For an <img src="" /> (Chrome):

Accept: image/avif,image/webp,image/apng,image/svg+xml,⏎⃠
        image/*,*/*;q=0.8

For a CSS font source (Firefox):

Accept: application/font-woff2;q=1.0,⏎⃠
        application/font-woff;q=0.9,*/*;q=0.8

Content Negotiation

Servers can use this info to send the best quality info to the user agent. e.g. with Firefox's image accept:

Accept: image/webp,*/*

… a server/application could send a WebP image to this browser, but send PNG to older browsers.

Content Negotiation

The Accept-language header says something about what the user can read:

Accept-Language: en,fr;q=0.5

Server can send most-appropriate translation for that user.

Content Negotiation

But the language is human-configured (and usually just the OS's install language). Always let the user override on your site with a preference.

Frameworks can help give some infrastructure for multi-language sites. Translating every user-visible string in your application is not easy.

Content Negotiation

Accept-encoding says how content can be encoded (≈compressed) for transport. e.g.

Accept-Encoding: gzip, deflate, br

This lets the server compress the message body when it's sent (or ship a pre-compressed version).

Content Negotiation

Content should be compressed if it helps: files are large enough and compressible. HTML, CSS, JS, SVG compress very well.

Static content: server can be configured.

Dynamic content: framework can help, or maybe the server can do it.