We have talked a lot about how control/represent the logic of our program: if
, else
, for
, while
, writing functions, … .
That's important (and a lot of your job for the rest of the semester is to get a feeling for how to use and assemble those pieces), but the way we store information in our code matters too.
In general, data structures
are ways to store your data in a program. There can be many tradeoffs when it comes to fast manipulation of data, using less of your computer's memory, ease of working with the data, etc.
But for this course, we'll explore a few data structures (types) that Python gives us, and leave the more general concerns for another course.
We have already met tuples as a type used to manage multiple values as a single variable (or function argument, or return value).
We usually can just deal with individual values like we have been doing:
def magnitude(x, y): return math.sqrt(x*x + y*y) x = 3 y = 4 size = magnitude(x, y)
But it's sometimes more pleasant to think about a single value with multiple components.
def magnitude(point): x, y = point return math.sqrt(x*x + y*y) point = (3, 4) size = magnitude(point)
That's what Pillow was doing with image sizes, colours, etc. I don't want to have to pass in separate red, green, and blue values: I want to think about a single colour
.
One notable use: functions always return one thing. If we want to return multiple values, we can return a tuple with as many pieces as we need.
def scale(point, factor): x, y = point return factor*x, factor*y point = (3, 4) doubled = scale(point, 2)
Terminology: a tuple of two things is generally called a pair
; with three things, a triple
; with four things, a 4-tuple
.
To get the parts out of a tuple, the most beautiful way is probably a multiple assignment:
x, y = point
As long as the number of things on the left side matches the size of the tuple, it will get disassembled into those variables.
It's common for tuples to contain different types in each position, creating an ad hoc data structure for a more complex value
.
student1 = ("Jane Doe", 4.33) student2 = ("Bart Simpson", 2.33) name, gpa = student1
In some sense, this is similar to what classes/objects are doing: storing all of the aspects of a single concept in one Python value.
Tuples can be compared for equality: all elements must be equal for the tuples to be equal.
When compared for less/greater, the first element that's different determines the result (i.e. lexicographic order).
point1 = (1.0, 2.3) point2 = (1.0, 2.4) point3 = (1.0, 2.3) print(point1 == point2) print(point1 < point2) # because 2.3 < 2.4 print(point1 == point3)
False True True
Python's lists are another type that can contain multiple values.
Lists are written with square brackets, can be any length, and typically hold several related values of the same type.
measurements = [1.2, 1.3, 0.9, 1.1, 1.2, 1.0] fruits = ["apple", "banana", "pear"]
Elements of a list can be extracted with square brackets and the position of the element (counting from zero), just like extracting characters from strings:
fruits = ["apple", "banana", "pear"] print(fruits[0]) print(fruits[1]) word = "something" print(word[0]) print(word[1])
apple banana s o
You can get elements out of a tuple this way too, but it's less of a normal
thing to do.
It's also possible for functions/methods to return lists. e.g. the string method .split
can split a string into parts around a specific delimiter.
date_string = "2024/06/19" parts = date_string.split("/") print(parts)
['2024', '06', '19']
The range
function we have been using with for
loops doesn't return a list (it returns an object that can generate the values in the range as needed), but it can be converted to a list.
print(range(10)) print(type(range(10))) print(list(range(10)))
range(0, 10) <class 'range'> [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
The reason it's not a list: range(1000000000)
doesn't take much space in memory. A list with a trillion elements does.
Unlike other types we have seen, lists can be modified after their initial creation. For example, the .append
method adds another element to the end of a list.
fruits = ["orange", "banana", "pear"] fruits.append("kiwi") print(fruits)
['orange', 'banana', 'pear', 'kiwi']
Note the distinction being made here: for a string (or integer or float or tuple), you can change the contents of a variable that holds one of them, but you always do so by creating a brand new instance of that type.
favourite = "apple" favourite = "mango" favourite = favourite + "s"
But for a list, the one that's already in memory gets updated without being totally recreated.
fruits = ["orange", "banana", "pear"] fruits.append("kiwi")
Python's lists have several useful methods to work with them. For example, .append
, or we can easily sort an existing list.
fruits.sort() print(fruits)
['banana', 'kiwi', 'orange', 'pear']
The .sort
method promises \(n\log n\) running time for a list of \(n\) elements (which is the best worst-case running time possible in general for a sorting algorithm).
We could write a program like this to get a bunch of numbers from the user as input, then sort them:
num = int(input("How many numbers? ")) values = [] for i in range(num): x = float(input("Next number: ")) values.append(x) print("Here they are in order:") values.sort() print(values)
When run it would look like:
How many numbers? 4 Next number: 18.2 Next number: 4.3 Next number: -1.2 Next number: 100.1 Here they are in order: [-1.2, 4.3, 18.2, 100.1]
We have seen code shaped like this many times:
values = [] for i in range(num): x = float(input("Next number: ")) values.append(x)
The pattern I'm referring to: start with an empty
thing, and build the result as you loop through your values (in whatever form).
The accumulator variable (i.e. values
there) is used to build the final result step-by-step.
This accumulator pattern can be useful for building lists:
values = [] for i in range(10): values.append(random.randint(1, 100)) print(values)
Or for summing values (in total
here):
total = 0 pos = 0 while total < 100: total = total + values[pos] pos = pos + 1 print(total)
Or building some other collection:
letters = "" for i in range(26): letters = letters + chr(97 + i) print(letters)
Those three examples will output something like:
[61, 37, 9, 100, 39, 77, 33, 8, 40, 49] 107 abcdefghijklmnopqrstuvwxyz
I don't care about the accumulator pattern specifically, but we have seen it so many times, it might as well have a name.
There are countless patterns like this in programming. A few have names; most just are things people have seen/done before so they recognise when they're needed.
Maybe we could call pos
here a counter pattern
(even though nobody else does).
total = 0 pos = 0 while total < 100: total = total + values[pos] pos = pos + 1 print(total)
[Everybody would call pos
a counter
, though.]
A lot of things that can be done with lists are identical to strings.
They can be joined (concatenated) with a +
operator:
cmpt = ["CMPT 120", "CMPT 125"] macm = ["MACM 101", "MACM 201"] courses = cmpt + macm print(courses)
['CMPT 120', 'CMPT 125', 'MACM 101', 'MACM 201']
Elements can be extracted by indexing (like characters in a string):
print(courses[0]) print(courses[1])
CMPT 120 CMPT 125
Unlike strings, we can modify a single element by assigning to it:
courses[3] = "CMPT 201" print(courses)
['CMPT 120', 'CMPT 125', 'MACM 101', 'CMPT 201']
Also unlike strings, we have seen some methods that modify an existing list: .sort()
, .append(x)
. Also .reverse()
.
Lists can be repeated with *
and a number.
This can be useful to create an initial list of values. e.g. suppose we want to count the number of times 10 different things occur:
counts = [0] * 10 print(counts)
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
then write whatever loop and when event i
happens:
counts[i] = counts[i] + 1
There's a method on strings that might be useful around lists: if you have a list of strings, they can be joined with a delimter using the string .join
method.
words = ["one", "two", "three", "four"] added = "+".join(words) thened = ", then ".join(words) print(added) print(thened)
one+two+three+four one, then two, then three, then four
Converting a string to a list will give you a list of each character:
string = "characters" chars = list(string) print(chars)
['c', 'h', 'a', 'r', 'a', 'c', 't', 'e', 'r', 's']
When we could use .join
to put it back together if we want.
string2 = "".join(chars) print(string2)
characters
So far, we have been using the for
loop almost entirely with range
.
But for
can iterate over any collection
in Python: range
is one, but so is a list.
for colour in ["red", "green", "pink", "chartreuse"]: print(colour)
red green pink chartreuse
We could use this with the accumulator pattern to select some elements from a list:
colours = ["black", "yellow", "aquamarine", "rebeccapurple", "pink", "lime", "magenta", "palegreen", "red"] long_names = [] for c in colours: if len(c) > 6: long_names.append(c) print(long_names)
['aquamarine', 'rebeccapurple', 'magenta', 'palegreen']
Iterating over a list like that is the nicest way to go through a list's elements, but sometimes you might also need to know where the element came from. You can always index the list as you go through it: it's easy to create a range
of the positions in the list.
for i in range(len(colours)): c = colours[i] if len(c) > 6: print(c + " in position " + str(i))
But there's something a little ugly about creating the index (i
) just to immediately look back to the list.
There's a built-in function enumerate
that produces pairs of the position and the elements in a list.
e.g. for our colours
list, it will produce tuples (0, "black")
, (1, "yellow")
, (2, "aquamarine")
, … .
for i, c in enumerate(colours): if len(c) > 6: print(c + " in position " + str(i))
Along with indexing a single element, it's possible to take a slice out of a list or string.
words = ["swim", "slide", "reporter", "complex", "battery", "raise", "feature", "peanut", "aviation"] print(words[2]) print(words[2:5])
reporter ['reporter', 'complex', 'battery']
The slice notation [2:5]
refers to the sub-list of values 2
to 5-1 == 4
. (Like with range
, slicing does not include the end
value.)
There's a shortcut for from the start
and to the end
when slicing: just omit the corresponding slice position:
title = "Slicing & Dicing" print(title[4:7]) # positions 4 to 6 print(title[:7]) # the start to position 6 print(title[10:]) # position 10 to the end
ing Slicing Dicing
The one before the endpoint
behaviour of range
and slicing might annoy you, but it makes sense if you're thinking about the length of the collection: range(a, b)
thing[a:b]
both contain b - a
Similarly, range(n)
thing[:n]
both contain n
Negative indexes count from the end of a string or list: -1
is the last element, -2
is the second last, etc.
vowels = "aeiouy" first_primes = [2, 3, 5, 7, 11, 13, 17, 19] print(vowels[-1]) print(first_primes[-1]) print(vowels[-3:]) print(first_primes[1:-1])
y 19 ouy [3, 5, 7, 11, 13, 17]
These imply easy ways to express all but the first
and all but the last
:
odd_primes = first_primes[1:] usual_vowels = vowels[:-1] print(odd_primes) print(usual_vowels)
[3, 5, 7, 11, 13, 17, 19] aeiou
These probably come up the most often of all slices: if you're processing a list, it's common to deal with thelist[0]
and then have thelist[1:]
remaining (or thelist[-1]
and then thelist[:-1]
).
A few slides ago: unlike other types we have seen, lists [and
.Image
s] can be modified after their initial creation
The ability to modify an object after their initial creation is an important property of a type: mutability.
Types that cannot be changed after they are created are immutable.
Strings are immutable: we can create many different strings, but each one is unchangable. This code created three different strings:
favourite = "apple" favourite = "mango" favourite = favourite + "s"
With a string, we cannot modify in-place, only recreate (e.g. build a whole new string and put it in a variable with a =
).
favourite[0] = "t"
That fails with a TypeError
.
To be clear, what we're asking for here:
favourite = "mango" favourite = favourite + "s"
"mango"
somewhere in memory.favourite
refer to that thing in memory."mangos"
somewhere in memory.favourite
refer to that."mango"
because it's not needed anymore.With a list, we can modify the existing instance in a few ways.
fruits = ["orange", "banana", "pear"] fruits.append("kiwi") fruits.append("cherry") fruits[1] = "strawberry" print(fruits) fruits.sort() print(fruits)
['orange', 'strawberry', 'pear', 'kiwi', 'cherry'] ['cherry', 'kiwi', 'orange', 'pear', 'strawberry']
You can also build a new list by manipulating one that already exists, for example +
can be used to join lists (also like strings):
fruits = ["orange", "banana", "pear"] fruits = fruits + ["kiwi"]
But I'd expect that to be slower (if we measured carefully enough) than changing the one that's already there.
fruits = ["orange", "banana", "pear"] fruits.append("kiwi")
Here's we're aking for:
fruits = ["orange", "banana", "pear"] fruits.append("kiwi")
fruits
refer to that.It's less work than completely reconstructing the entire list.
So mutable types offer more flexibilty in how they are modified.
On the other hand, immutable types are more predictable: our values can only be assigned by a variable assignment statement, not by anything else.
For an immutable type, anything we have in a variable is nicely protected: the only thing that can change it is reassigning the variable.
def add_one(n): n = n + 1 value = 0 add_one(value) print(value)
0
The
in the function only affects a local variable within the function.n = …
On the other hand, a mutable type might be modified if we pass it as a function argument.
def append_one(lst): lst.append(1) values = [3, 4, 5] append_one(values) print(values)
[3, 4, 5, 1]
That can make it harder to know if our values are being changed, unless we know/trust the function.
Basically, if there's no way to change a value without an assignment statement (var = …
), then that value is immutable.
That includes short-form assignments like +=
. These are equivalent, but integers are still immutable.
count = count + 1
count += 1
We have seen several situations where a value must be duplicated. Assignment to a new variable:
a = "whatever" b = a
Or passing an argument to a function:
def f(x): return x / 2 a = 2.0 print(f(a))
In both cases, the contents of a
get copied
.
The details of how Python copies
values are important.
First: when a variable is created, its contents are stored somewhere in memory and the variable name refers to that. i.e. the variable is a reference to its contents.
So, code like this:
greeting = "Hello world" values = [10, 20, 30]
… creates something I imagine like this:
If we assign the same object to another variable, Python does not copy the value, it just makes the other variable also refer to it.
greeting = "Hello world" hello = greeting
That's usually good: it saves memory (because only one copy is stored) and is faster (because it's not necessary to copy the potentially-huge object, just the relatively-small reference).
These variables are aliases of the same object, or that object is aliased.
For many data types, we'll never notice aliasing.
count = 1234 old_count = count count = count + 1
After the first two lines:
After the third line, count
is reassigned:
But when combined with a mutable type, aliasing becomes visible.
values = [6, 9, 12] more_values = values more_values.append(15)
After the first two lines:
After the third line:
Really?
values = [6, 9, 12] more_values = values more_values.append(15) print(values) print(more_values)
Yes.
[6, 9, 12, 15] [6, 9, 12, 15]
We saw the same thing happening when passing a mutable type to a function:
def append_one(lst): lst.append(1) values = [3, 4, 5] append_one(values) print(values)
[3, 4, 5, 1]
The argument is passed to the function as a reference, so the function can modify mutable arguments.
If we're giving a mutable value to a function, we had better really trust it. A mutable value that's given to a function (or any other code) can be changed by that code. If it's well-behaved, that will at least be documented.
def reorder(values): """ Modifies values by moving elements so that… """ ⋮
def reorder(values): """ Returns a rearranged copy of values so that… """ ⋮
If doing a variable assignment doesn't copy the value, just the reference, what if we actually want a real copy of a value?
Sometimes, it's necessary to have an independent copy that we can modify separately.
First observation: any expression using a value forces calculation of the result. That will be a new independent value in memory.
values = [6, 9, 12] more_values = values + [15] more_values.append(18) print(values) print(more_values)
Here, more_values
is unrelated to values
.
[6, 9, 12] [6, 9, 12, 15, 18]
For lists, this implies an easy way to make a real copy: slicing all elements. We know that when slicing like [a:b]
, leaving out the a
means from the start
and omitting b
means to the end
. So, the slice [:]
means all elements
.
values = [6, 9, 12] more_values = values[:] more_values.append(15) print(values) print(more_values)
[6, 9, 12] [6, 9, 12, 15]
The [:]
slice is an idiom in Python that's both common and very confusing until you recognize it.
more_values = values[:]
Like many such idioms, it takes a few words of explanation, but once you know the pattern, it's obvious
.
Many classes have a .copy()
method that know how to copy instances of that class (efficiently, one assumes).
e.g. an Image
's .copy()
method can be used if you want to keep an instance of the original image while drawing on another.
from PIL import Image, ImageDraw original = Image.open("Vancouver_Skyline_and_Mountains.jpg") for_drawing = original.copy() draw = ImageDraw.Draw(for_drawing) ⋮ for_drawing.save("result.png") some_other_operation(original)
Or, the copy
module contains a copy
function that will make a copy of any Python object.
import copy value = AnyClassYouHave(1, 2, 3) another = copy.copy(value)
The lists we have been using have a few notable properties:
for x in some_list
.append()
).That's all great, but one notable problem:
Imagine scanning some huge data set for duplicates and needing to quickly decide we have seen that before
.
There are a few ways to create a data structure that is faster to search than a list (later: binary search).
Built into Python is the set data type. Like lists, a set is a collection of values, but:
i.e. they're like sets from math.
Sets can be created with curly braces or the set
function.
colours = {"red", "purple", "green", "blue", "black"} print(colours)
The first time I ran the program:
{'purple', 'red', 'green', 'black', 'blue'}
The next time:
{'blue', 'green', 'purple', 'red', 'black'}
So it doesn't make any sense to index out of a set because there's not well-defined result the language could give: colours[0]
gives a TypeError
.
But we can iterate through the element: they might come in any order, but we will get them all.
for c in colours: print(c)
We can't .append
to a list: the word implies an at the end
behaviour that sets can't promise. We do get an .add
method that does the analogous thing.
colours = {"red", "purple", "green", "blue", "black"} print(colours) colours.add("yellow") print(colours) colours.add("purple") # already in the set, so no effect print(colours)
{'blue', 'green', 'purple', 'red', 'black'} {'blue', 'green', 'purple', 'red', 'yellow', 'black'} {'blue', 'green', 'purple', 'red', 'yellow', 'black'}
So why do we want sets, really? It seems like they are like lists but don't do as much.
First reason: sometimes, having a no duplicates
data structure is really convenient.
e.g. suppose we want to detect repeated letters in a string (just a boolean True
/False
result: repeats or not).
We can convert strings to a set of characters with set()
. Observation: if there are repeated characters, there will be fewer elements in the set than in the original string.
>>> w = "repeated" # three 'e's >>> print(len(w)) 8 >>> print(set(w)) # all 3 'e' become one in the set {'t', 'd', 'e', 'p', 'a', 'r'} >>> print(len(set(w))) 6
That implies a very easy way to detect repeated characters:
def has_duplicate_chars(word): chars_set = set(word) return len(word) != len(chars_set) print(has_duplicate_chars("repeated")) print(has_duplicate_chars("uncopyrightable"))
True False
In general, converting to a set can be a very simple discard duplicates
technique.
Also, sometimes you just have an unordered collection and a set is the most honest
way to think about it.
e.g. for assignment 3, the hands of cards could be sets.
… except the dealer's hand which has a definite first
element (the hole card), so probably stick with lists for the assignment.
But I started talking about sets by saying finding elements in a list is slow. Is it faster for a set?
Finding an element in a list: running time \(n\) because we have to search the whole collection.
How does searching in a set work?
There's a Python operator
that will tell us whether or not a specific element is in a collection.in
>>> some_list = [2, 4, 6, 8] >>> some_set = set(some_list) >>> some_set {8, 2, 4, 6} >>> 4 in some_list True >>> 5 in some_set False
Note: this is very different that the way the word
is used with a in
for
loop: there we are iterating through the elements, not checking membership.
Finding an element in a set: we are promised constant running time (\(O(1)\)) for
(in non-pathological cases: details will have to wait for another course).in
That should be much faster for large lists/sets. Is it really?
I'm going to check with the timeit
module.
The idea: we hand it some code to set up the scenario, and the code that we want to time. It runs the setup code once, and the test code a bunch of times and reports the total elapsed time for the test.
With a collections of 100 values, searching the list took 207 ns on average. For a set, 9 ns on average.
For 1000 values, 2080 ns for the list and 9 ns for the set.
It certainly looks like the running times are proportional to \(n\) for the list and constant for the set.
Some things to note:
It's also possible to time code with %timeit
magic in some Python shells (but sadly, not IDLE). This is in the iPython shell:
In [1]: import random In [2]: n = 1000 In [3]: def random_list(): ...: res = [] ...: for i in range(n): ...: res.append(random.randint(0, n*200)) ...: return res ...: In [4]: rand_list = random_list() In [5]: rand_set = set(rand_list) In [6]: %timeit 5 in rand_list 2.06 µs ± 11.1 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each) In [7]: %timeit 5 in rand_set 11.5 ns ± 0.0455 ns per loop (mean ± std. dev. of 7 runs, 100,000,000 loops each)
Something we can infer from the .add
method: sets are mutable.
There are some other methods to modify a set: .remove(x)
to remove a specific element, .pop()
to extract an arbitrary element and remove it.
colours = {"red", "purple", "green", "blue", "black"} colours.remove("red") c = colours.pop() print(c) print(colours)
blue {'green', 'purple', 'black'}
As we have seen, there are occasions where having a mutable data structure isn't desirable: if you want to be sure it doesn't change.
There is a related type frozenset
that is like set
but immutable: no .add
, .pop
, etc.
colours = frozenset({"red", "purple", "green", "blue"})
We probably won't use it, but the lesson: programmers care enough about mutable/immutable types that it was added to the language.
Speaking of mutability…
Because of they way they are implemented, sets can only contain immutable elements. The data structure needs to be able to guarantee the contents of its elements don't change while it's holding them.
>>> set_of_tuples = {(1,2,3), (4,5,6)} >>> set_of_strings = {"abc", "def"} >>> set_of_lists = {[1,2,3], [4,5,6]} TypeError: ...
If lists were allowed in sets, they could be changed after the set was built.
l1 = [1, 2, 3] l2 = [4, 5, 6] set_of_lists = {l1, l2} l1.append(4) l2.append(7)
The way sets can be sure their elements aren't modified: only accept immutable elements.
Yet another reason mutability matters: sometime we have to have immutable types if we want to use them in particular ways.
As we discussed in the Debugging
slides, there are several categories of problems you might have in your programs.
Syntax errors will stop the program from running at all.
Semantic errors will let the program run but do the wrong thing: these are necessarily the programmer's problem.
But in between are runtime error or exceptions: these will stop the program as it's running because some impossible situation occurred.
Some examples:
import do_magic_stuff # causes a ModuleNotFoundError s = "abcd" s[0] = "A" # causes a TypeError c = s[6] # causes an IndexError n = int("ten") # causes a ValueError x = 10 ** 10000 # causes a ValueError x = something_else # causes a NameError x = 10 / 0 # causes a ZeroDivisionError
Exceptions are somehow categorized: TypeErrors
are about a problem relating to the type of values, IndexError
is a problem with indexing, etc.
Python provides a way to catch
exceptions and handle them gracefully. You need to know where the exception might occur and which exception will be. Then…
A try
/except
block will let you decide how to deal with an exception (instead of the program ending with an error message).
numstr = input("How many? ") try: num = int(numstr) except ValueError: print("That wasn't a number") num = 0 print(num)
When run, that code will look like one of these:
How many? 123 123
How many? lots That wasn't a number 0
The behaviour of a try
/except
:
try
runs.except
code is skipped.except
line, the try
stops and except
runs.We can finally enforce good
user input if we want to. The pattern will be something like:
from PIL import Image img = None while img == None: try: filename = input("Image filename: ") img = Image.open(filename) process_image(img) except FileNotFoundError: print("Sorry the image doesn't exist.")
It's considered good style to minimize the amount of code in the try
block: you don't want to catch exceptions that aren't the one you were expecting and/or don't know how to deal with.
Or if you can avoid the excption altogether, that's probably nicer.
e.g. suppose we want to index characters from a string, but return spaces outside of it's real
indices. I'd prefer this:
def index(string, position): if 0 <= position < len(string): return string[position] else: return " "
to this, but that's a stylistic choice:
def index(string, position): try: return string[position] except IndexError: return " "
You can use a try
/except
any time something might cause a runtime error (and you know how to clean up after it).
It's probably better to avoid the error altogether if possible, but that's not always possible/practical.