HTTP is the protocol used to transmit data around the WWW.
HyperText Transfer Protocol, if you like acronyms.
X is on the web ==
X can be fetched by HTTP.
HTTP is a relatively simple protocol, but it does things that many web developers don't know about. Knowing what the protocol can do will help you make better web systems.
Let's look at a simple HTTP request: the user has clicked on a link to
The user agent (client/browser) initiates the conversation by making a TCP connection to
cmpt470.csil.sfu.ca on port 80 (the default)…
and it sends a request like this:
GET /~ggbaker/test.html HTTP/1.1 Host: cmpt470.csil.sfu.ca User-Agent: Mozilla/5.0 Accept: text/html,application/xhtml+xml,*/*;q=0.8 Accept-Language: en Accept-Encoding: gzip, deflate, br Connection: keep-alive
…followed by a blank line.
The first line is the request line.
GET /~ggbaker/test.html HTTP/1.1
GET: the request method. (More later.)
/~ggbaker/test.html: the path of the URL (could include query string).
HTTP/1.1: the HTTP version.
The following lines are request headers. Each has a field name and value. e.g.
Host: cmpt470.csil.sfu.ca User-Agent: Mozilla/5.0
Host header is required.
This allows for name-based virtual hosting. i.e. multiple sites on one server and IP address.
Other headers can give info about the browser or user.
The blank line indicates the end of the request.
… or start of the request body if there is one.
Once the request is done, the server sends a response…
The request method indicates the overall “action”.
GET is used any time you click a link or type a URL.
When an HTML
<form> is submitted, it can be either:
GETmethod, with form contents encoded in the URL query string:
POSTmethod, with form contents encoded in the request body (after that blank line).
POSTrequest, since not everything needed for the request is in the URL.
GETrequests should be safe. That is, have no (significant) side effects. They can be safely reloaded.
It's common to use the Post/Redirect/Get pattern is used to avoid users accidentally reloading
We will see other HTTP request methods later when we talk about REST.
The server's response will look something like this:
HTTP/1.1 200 Okay Server: nginx/1.16.0 Content-type: text/html; charset=utf-8 <!DOCTYPE html> <html><head>…
The first line is the status line.
HTTP/1.1 200 Okay
HTTP/1.1: the response HTTP version
200: the status code indicating success/failure of the request. (More later.)
Okay: the reason phrase, a human-readable version of the status code.
The following lines are response headers. Same format as the request headers. e.g.
Server: nginx/1.16.0 Content-type: text/html; charset=utf-8
Last line there says that the content is HTML (Internet Media Type
text/html) with characters encoded as UTF-8.
Again, a blank line to separate the headers from the response body.
For a 200 response, the body contains the actual contents of the resource that was requested. It can be empty for some responses.
The HTTP message format: 1 special line (request/status line), n field/value headers, blank line, message body.
It's not unique to HTTP. It's borrowed from RFC 822, email. Email has headers like
Subject, but the format is the same.
Sometimes the terminology leaks and the body is called the message body.
All of those examples were HTTP/1. The current version is HTTP/2.
Conceptually, everything is the same in HTTP/2: methods, status codes, headers, URLs. All of the same ideas as HTTP/1.1.
That info is encoded in a more efficient binary format. Fewer bytes; easier parsing; not human-readable.
A remarkably minimal change from HTTP/1.0 in 1996.
As a developer, you shouldn't have to worry much: let the tools translate your method calls into the right encoding.
The status code in the response indicates what kind of response is being sent.
The first digit indicates the overall type.
2xx codes indicate success.
206 Partial Content
Rangeheader); that part is being sent.
3xx codes are
redirection, but really more like
things the user doesn't have to worry about.
304 Not Modified
If-none-match header. These indicate the age and specific content cached.
No body is returned.
More on caching later.
301 Moved Permanently
303 See Other
307 Temporary Redirect
Locationheader with the new URL. User is taken there automatically.
The difference is semantic: how can bookmarks, search indexes, etc adapt?
4xx codes are
client errors: things the user (or user agent) did wrong.
404 Not Found
5xx codes are
server errors: things that are problems on the server which the user can't hope to fix.
500 Internal Server Error
502 Bad Gateway
504 Gateway Timeout
503 Service Unavailable
The basic job of a web server is the same as any other server in a client/server conversation: when requests come in, send a response.
A web server does it by speaking HTTP.
There are many HTTP servers to choose from. Some of the most popular:
The job of a web server is more complicated than others like SSH, FTP (in some ways).
Web content can come from many different sources. The server must be configured to do the right thing for each request.
One possibility: the requested content is in a file on disk. This is static content.
That's what you're creating for exercises 1 and 2.
Content can also be generated when it is requested by a user agent: dynamic content. This allows content that changes frequently, or is different for each user, or ….
Whether content is static or dynamic is (possibly) invisible to the user.
The server must know what to do for each request.
How to configure that depends on the server (and programming language/framework). It can be more painful than you'd expect.
It's easy to spend hours optimizing your backend dynamic code, but then be slowed down by 3 MB of CSS + JS + images.
Another possibility: the server that gets the request from the user doesn't do the work, it forwards the request to another (application) server behind it, and then passes along the response.
It's acting as a gateway or reverse proxy or load balancer or frontend server. More later.
There are many tools for server-side programming. Too many to talk about now. (Rails, Django, Snap, Play, Express, …)
Choice of server-side framework/tools will be a big part of the technology evaluation, and the course.
An old-fashioned (but easily understandable) way to generate dynamic content: CGI scripts.
A CGI script is an executable program that prints the contents of the resource to stdout. The web server runs the program and sends its output to the client.
You'll probably never use CGI, but it's a good way to see the basics.
A complete CGI script in Python:
#!/usr/bin/env python3 print('Content-type: text/html; charset=utf-8') print() print('<!DOCTYPE html>') print('<html lang="en"><head><meta charset="utf-8" />') print('<title>Page Title</title></head><body>') print('The page</body></html>')
This prints the last header(s) and message body for a
Input to server-side programs comes from a request (form submission, URL contents, etc). Output is an HTTP response.
Other than that, it's all just programming: do some computation, make some DB queries, write a file, whatever.
There are many common problems to solve: how to get data to and from a database, how to connect URLs to logic that generates a response, how to present data in HTML, ….
This leads to server-side development frameworks to help with common tasks.
e.g. Rails, Django, Snap, Play, Express.
These frameworks (and client-side frameworks and other tools) will be the subject of the technology evaluation.
It's easy to say
your HTML is sent by HTTP, but wrong. Networks don't send characters, they send bits. Characters must be encoded as bits for transport.
The source and destination have to agree on how to convert characters ↔ bits. If not, characters will be displayed wrong.
It's easy to get wrong if you're not paying attention.
A character set is a numbered list of characters. Here are a few we might number:
We need to number everything we want to treat as a character.
In the early days, it was easy: ASCII encodes 95 characters (numbered 32–126, plus some control characters).
Store each one as a byte, and it's good enough for English text.
But there are other languages too…
ASCII characters are numbered <128, so we still have half the byte left. That's enough for Western Europe, or Greek, or Hebrew, but only one at a time.
It turns out, people want to include several languages in a document. Also, Chinese and Arabic exist. And math. And the poo emoji.
The Unicode character set is designed to handle all of the worlds writing systems. It comes to about 130k characters or code points (with space for >1M).
A character encoding is a way to convert the character numbers ↔ bits.
With Unicode, it's not obvious how to turn characters into bits. Three or four bytes for each character? Seems wasteful since all ASCII characters (HTML tags, CSS properties, etc) are numbered <128.
The solution: UTF-8. It's a very clever way to encode Unicode characters efficiently. It's the only realistic choice for the modern web.
In particular, all ASCII characters are encoded just like ASCII.
Always explicitly declare your character encoding to ensure the user agent gets it right. In HTML5:
<meta charset="utf-8" />
In HTTP headers:
Content-type: text/html; charset=utf-8
Lessons for developers:
The #AskObama tweet was the result of a left-single-quote apostrophe correctly encoded as UTF-8 but incorrectly decoded using some system's default.
>>> t = u'let\U00002019s' >>> print(t) let’s >>> print(t.encode('utf-8')) b'let\xe2\x80\x99s' >>> print(t.encode('utf-8').decode('windows-1252')) letâ€™s
We have seen the HTTP status codes
410 for moved resources.
These are sent with a
Location header to indicate the new location, and the user agent will go there automatically.
HTTP/1.1 301 Moved Permanently Location: http://example.com/newlocation Content-Type: text/html; charset=utf-8 <title>301 Moved Permanently</title> <h1>Moved Permanently</h1> The document has moved <a href="http://example.com/newlocation">here</a>.
There are several ways to produce redirects in a system you create.
The web server can do it. e.g. Nginx
rewrite. Most useful:
Your application logic can do it. e.g. Django
redirect like this Post-Redirect-Get pattern:
def new_object_view(request): if request.method == 'POST': new_obj = … new_obj.save() return redirect(new_obj.get_absolute_url())
Or in a URL pattern:
urlpatterns = [ path('details/<int:pk>/', RedirectView.as_view(url=reverse(…)), name='old-details-view'), ]
Doing the redirect in your logic lets you be more precise. e.g.
303 See Other.
General principle: URLs should never become invalid.
When planning a site/app, take some time to come up with a good URL schema and plan for the future. Redirect if it's really necessary to move stuff.
When you put a URL into production, you should be committing to have it work until the end of time.
Different HTTP user agents have different capabilities. So do their users. e.g. file types handled, natural languages read, etc.
HTTP provides a way to automatically and transparently deal with some of these: content negotiation.
With every request, browsers send HTTP headers indicating their capabilities using the
GET /~ggbaker/test.html HTTP/1.1 Host: cmpt470.csil.sfu.ca Accept: text/html,application/xhtml+xml,⏎⃠ application/xml;q=0.9,*/*;q=0.8 Accept-Language: en,fr;q=0.5 Accept-Encoding: gzip, deflate, br
All share a “quality” syntax. Quality of every value is 0–1.
a at quality 1.0;
b at 0.9;
c at 0.9 (since they must be non-increasing); and anything else at quality 0.1.
Accept header gives media types that the browser likes.
Accept: text/html,application/xhtml+xml,⏎⃠ application/xml;q=0.9,*/*;q=0.8
This says that it prefers HTML or XHTML (with quality 1.0). Other XML slightly worse (quality 0.9). Anything else okay (quality 0.8).
Browsers change this to indicate useful things for different kinds of requests. For a
<link rel="stylesheet" />:
<img src="" /> (Chrome):
For a CSS font source (Firefox):
Accept: application/font-woff2;q=1.0,⏎⃠ application/font-woff;q=0.9,*/*;q=0.8
Servers can use this info to send the best quality info to the user agent. e.g. with Chrome's image accept:
… a server/application could send a WebP image, but send PNG to Firefox.
Accept-language header says something about what the user can read:
Server can send most-appropriate translation for that user.
But the language is human-configured (and usually just the OS's install language). Always let the user override on your site with a preference.
Frameworks can help give some infrastructure for multi-language sites. Translating every user-visible string in your application is not easy.
Accept-encoding says how content can be encoded (≈compressed) for transport. (FF, Chrome, Edge)
Accept-Encoding: gzip, deflate, br Accept-Encoding: gzip, deflate, sdch Accept-Encoding: gzip, deflate
This lets the server compress the message body when it's sent (or ship a pre-compressed version).
Content should be compressed if it helps: files are large enough and compressible. HTML, CSS, JS, SVG compress very well.
Static content: server can be configured.
Dynamic content: framework can help.