HTTP is the protocol used to transmit data around the WWW. HyperText Transfer Protocol
, if you like acronyms.
X is on the web
== X can be fetched by HTTP.
HTTP is a relatively simple protocol, but it does things that many web developers don't know about. Knowing what the protocol can do will help you make better web systems.
Let's look at a simple HTTP request: the user has clicked on a link to http://www.sfu.ca/~ggbaker/test.html
.
The user agent (client/browser) initiates the conversation by making a TCP connection to www.sfu.ca
on port 80 (the default)…
and it sends a request like this:
GET /~ggbaker/test.html HTTP/1.1 Host: www.sfu.ca User-Agent: Mozilla/5.0 Accept: text/html,application/xhtml+xml,*/*;q=0.8 Accept-Language: en Accept-Encoding: gzip, deflate, br Connection: keep-alive
…followed by a blank line.
The first line is the request line.
GET /~ggbaker/test.html HTTP/1.1
GET
: the request method. (More later.)/~ggbaker/test.html
: the path of the URL (could include query string).HTTP/1.1
: the HTTP version.The following lines are request headers. Each has a field name and value. e.g.
Host: www.sfu.ca User-Agent: Mozilla/5.0
The Host
header is required.
This allows for name-based virtual hosting. i.e. multiple sites on one server and IP address.
Other headers can give info about the browser or user.
The blank line indicates the end of the request.
… or start of the request body if there is one.
Once the request is done, the server sends a response…
The request method indicates the overall “action”.
GET
is used any time you click a link or type a URL.
When an HTML <form>
is submitted, it can be either:
GET
method, with form contents encoded in the URL query string: http://…/foo/bar?name=Greg&cost=6
POST
method, with form contents encoded in the request body (after that blank line).Practical differences:
POST
.POST
request, since not everything needed for the request is in the URL.Semantic differences:
GET
requests should be safe. That is, have no (significant) side effects. They can be safely reloaded.POST
.It's common to use the Post/Redirect/Get pattern is used to avoid users accidentally reloading POST
pages.
We will see other HTTP request methods later when we talk about REST.
The server's response will look something like this:
HTTP/1.1 200 Okay Server: nginx/1.16.0 Content-type: text/html; charset=utf-8 <!DOCTYPE html> <html><head>…
The first line is the status line.
HTTP/1.1 200 Okay
HTTP/1.1
: the response HTTP version200
: the status code indicating success/failure of the request. (More later.)Okay
: the reason phrase, a human-readable version of the status code.The following lines are response headers. Same format as the request headers. e.g.
Server: nginx/1.16.0 Content-type: text/html; charset=utf-8
Last line there says that the content is HTML (Internet Media Type text/html
) with characters encoded as UTF-8.
Again, a blank line to separate the headers from the response body.
For a 200 response, the body contains the actual contents of the resource that was requested. It can be empty for some responses.
The HTTP message format: 1 special line (request/status line), n field/value headers, blank line, message body.
It's not unique to HTTP. It's borrowed from RFC 822, email. Email has headers like To
and Subject
, but the format is the same.
Sometimes the terminology leaks and the body is called the message body.
All of those examples were HTTP/1. The current version is HTTP/2.
Conceptually, everything is the same in HTTP/2: methods, status codes, headers, URLs. All of the same ideas as HTTP/1.1.
That info is encoded in a more efficient binary format. Fewer bytes; easier parsing; not human-readable.
A remarkably minimal change from HTTP/1.0 in 1996.
As a developer, you shouldn't have to worry much: let the tools translate your method calls into the right encoding.
The status code in the response indicates what kind of response is being sent.
The first digit indicates the overall type.
1xx
is informational
2xx
codes indicate success.
200 Okay
206 Partial Content
Range
header); that part is being sent.3xx
codes are redirection
, but really more like things the user doesn't have to worry about
.
304 Not Modified
If-modified-since
or If-none-match header
. These indicate the age and specific content cached.
No body is returned.
More on caching later.
301 Moved Permanently
302 Found
303 See Other
307 Temporary Redirect
Location
header with the new URL. User is taken there automatically.
The difference is semantic: how can bookmarks, search indexes, etc adapt?
4xx
codes are client errors
: things the user (or user agent) did wrong.
403 Forbidden
404 Not Found
301
, 303
, 410
forever.410 Gone
404
.5xx
codes are server errors
: things that are problems on the server which the user can't hope to fix.
500 Internal Server Error
502 Bad Gateway
504 Gateway Timeout
503 Service Unavailable
The basic job of a web server is the same as any other server in a client/server conversation: when requests come in, send a response.
A web server does it by speaking HTTP.
There are many HTTP servers to choose from. Some of the most popular:
The job of a web server is more complicated than others like SSH, FTP (in some ways).
Web content can come from many different sources. The server must be configured to do the right thing for each request.
One possibility: the requested content is in a file on disk. This is static content.
That's what you're creating for exercises 1 and 2.
Content can also be generated when it is requested by a user agent: dynamic content. This allows content that changes frequently, or is different for each user, or ….
Whether content is static or dynamic is (possibly) invisible to the user.
The server must know what to do for each request.
How to configure that depends on the server (and programming language/framework). It can be more painful than you'd expect.
Remember that even for a site with lots of dynamic content, there will still be many static resources (stylesheets, images, JavaScript, etc).
It's easy to spend hours optimizing your backend dynamic code, but then be slowed down by 3 MB of CSS + JS + images.
Another possibility: the server that gets the request from the user doesn't do the work, it forwards the request to another (application) server behind it, and then passes along the response.
It's acting as a gateway or reverse proxy or load balancer or frontend server. More later.
or backend programming. [vs JavaScript/client-side/frontend]
There are many tools for server-side programming. Too many to talk about now. (Rails, Django, Snap, Play, Express, …)
Choice of server-side framework/tools will be a big part of the technology evaluation, and the course.
There are many ways to connect your logic to a web server. Some are language-specific, some are intended for development only. Most of the time, your framework/tools will make the decision for you.
But let's see a quick example using WSGI, the common method for Python code to interface with a web server.
A minimal WSGI application:
def application(environ, start_response): status = '200 OK' response_headers = [ ('Content-Type', 'text/plain; charset=utf-8'), ] start_response(status, response_headers) content = 'This is content. \U0001F600' return [content.encode('utf-8')]
The web server is given this function, and can call it to generate content as needed. [complete code]
Input to server-side programs comes from a request (form submission, URL contents, etc). Output is an HTTP response.
Other than that, it's all just programming: do some computation, make some DB queries, write a file, whatever.
There are many common problems to solve: how to get data to and from a database, how to connect URLs to logic that generates a response, how to present data in HTML, ….
This leads to server-side development frameworks to help with common tasks.
e.g. Rails, Django, Snap, Play, Express.
These frameworks (and client-side frameworks and other tools) will be the subject of the technology evaluation.
But they are all fundamentally ways to more easily/flexibly/robustly create an HTTP response, like the CGI script a few slides ago.
It's easy to say your HTML is sent by HTTP
, but wrong. Networks don't send characters, they send bits. Characters must be encoded as bits for transport.
The source and destination have to agree on how to convert characters ↔ bits. If not, characters will be displayed wrong.
It's easy to get wrong if you're not paying attention.
A character set is a numbered list of characters. Here are a few we might number:
Number | Character |
---|---|
65 | A |
97 | a |
8838 | ⊆ |
63935 | 樂 |
128169 | 💩 |
We need to number everything we want to treat as a character.
In the early days, it was easy: ASCII encodes 95 characters (numbered 32–126, plus some control characters).
Store each one as a byte, and it's good enough for English text.
But there are other languages too…
ASCII characters are numbered <128, so we still have half the byte left. That's enough for Western Europe, or Greek, or Hebrew, but only one at a time.
It turns out, people want to include several languages in a document. Also, Chinese and Arabic exist. And math. And the poo emoji.
The Unicode character set is designed to handle all of the worlds writing systems. It comes to about 145k characters or code points (with space for >1M).
ŤИאඌଧჶགᛒℕ∮⌘⽩가𐤇𝄞𝌩🁊🀗🍔😝🚕
A character encoding is a way to convert the character numbers ↔ bits.
With Unicode, it's not obvious how to turn characters into bits. Three or four bytes for each character? Seems wasteful since all ASCII characters (HTML tags, CSS properties, etc) are numbered <128.
The solution: UTF-8. It's a very clever way to encode Unicode characters efficiently. It's the only realistic choice for the modern web.
In particular, all ASCII characters are encoded just like ASCII.
Always explicitly declare your character encoding to ensure the user agent gets it right. In HTML5:
<meta charset="utf-8" />
In CSS:
@charset "utf-8";
In HTTP headers:
Content-type: text/html; charset=utf-8
Lessons for developers:
The #AskObama tweet was the result of a left-single-quote apostrophe correctly encoded as UTF-8 but incorrectly decoded using some system's default.
>>> t = u'let\U00002019s' >>> print(t) let’s >>> print(t.encode('utf-8')) b'let\xe2\x80\x99s' >>> print(t.encode('utf-8').decode('windows-1252')) let’s
We have seen the HTTP status codes 301
, 302
, 303
, 410
for moved resources.
These are sent with a Location
header to indicate the new location, and the user agent will go there automatically.
HTTP/1.1 301 Moved Permanently Location: http://example.com/newlocation Content-Type: text/html; charset=utf-8 <title>301 Moved Permanently</title> <h1>Moved Permanently</h1> The document has moved <a href="http://example.com/newlocation">here</a>.
There are several ways to produce redirects in a system you create.
The web server can do it. e.g. Nginx rewrite
. Most useful:
http://www.example.com/
→ https://example.com/
http://example.com/Project_x/
→ http://example.com/X/
…/style.css
→ …/styles/main.css
Your application logic can do it. e.g. Django redirect
like this Post-Redirect-Get pattern:
def new_object_view(request): if request.method == 'POST': new_obj = … new_obj.save() return redirect(new_obj.get_absolute_url())
Or in a URL pattern:
urlpatterns = [ path('details/<int:pk>/', RedirectView.as_view(url=reverse(…)), name='old-details-view'), ]
Doing the redirect in your logic lets you be more precise. e.g.
410 Gone
or 303 See Other
.General principle: URLs should never become invalid.
When planning a site/app, take some time to come up with a good URL schema and plan for the future. Redirect if it's really necessary to move stuff.
When you put a URL into production, you should be committing to have it work until the end of time.
Different HTTP user agents have different capabilities. So do their users. e.g. file types handled, natural languages read, etc.
HTTP provides a way to automatically and transparently deal with some of these: content negotiation.
With every request, browsers send HTTP headers indicating their capabilities using the Accept*
headers.
GET /~ggbaker/test.html HTTP/1.1 Host: cmpt470.csil.sfu.ca Accept: text/html,application/xhtml+xml,⏎⃠ application/xml;q=0.9,*/*;q=0.8 Accept-Language: en,fr;q=0.5 Accept-Encoding: gzip, deflate, br
All share a “quality” syntax. Quality of every value is 0–1.
Accept-foo: a,b;q=0.9,c,*;q=0.1
That's a
at quality 1.0; b
at 0.9; c
at 0.9 (since they must be non-increasing); and anything else at quality 0.1.
The Accept
header gives media types that the browser likes.
Accept: text/html,application/xhtml+xml,⏎⃠ application/xml;q=0.9,*/*;q=0.8
This says that it prefers HTML or XHTML (with quality 1.0). Other XML slightly worse (quality 0.9). Anything else okay (quality 0.8).
Browsers change this to indicate useful things for different kinds of requests. For a <link rel="stylesheet" />
:
Accept: text/css,*/*;q=0.1
For an <img src="" />
(Chrome):
Accept: image/avif,image/webp,image/apng,image/svg+xml,⏎⃠ image/*,*/*;q=0.8
For a CSS font source (Firefox):
Accept: application/font-woff2;q=1.0,⏎⃠ application/font-woff;q=0.9,*/*;q=0.8
Servers can use this info to send the best quality info to the user agent. e.g. with Firefox's image accept:
Accept: image/webp,*/*
… a server/application could send a WebP image to this browser, but send PNG to older browsers.
The Accept-language
header says something about what the user can read:
Accept-Language: en,fr;q=0.5
Server can send most-appropriate translation for that user.
But the language is human-configured (and usually just the OS's install language). Always let the user override on your site with a preference.
Frameworks can help give some infrastructure for multi-language sites. Translating every user-visible string in your application is not easy.
Accept-encoding
says how content can be encoded (≈compressed) for transport. e.g.
Accept-Encoding: gzip, deflate, br
This lets the server compress the message body when it's sent (or ship a pre-compressed version).
Content should be compressed if it helps: files are large enough and compressible. HTML, CSS, JS, SVG compress very well.
Static content: server can be configured.
Dynamic content: framework can help, or maybe the server can do it.