Before you can analyse any data, you need to get it into your program.
Possibilities are infinite. A few are common…
Files are probably the most common place you'll get data before processing. You get a file in some format you can deal with and read it in.
In general: it's just programming. Read each line. Process it in whatever what is necessary. Put results in an array/DataFrame/whatever.
There are several common file formats you might come across, and the libraries will help you. We will talk about a few (later).
As usual: don't reinvent solutions if you can just call a function.
Databases are also common. You connect to the DB with some library and make (SQL) queries to get the data you need.
You'll likely iterate through the rows and do… something.
Won't come up much in this course, but take CMPT 354.
You may get the data you need by calling a web service (or REST API). Basically: you make HTTP requests to the right URL(s) and get back responses.
The first step in your pipeline might be to query the API and store the results locally (in a file, or DB, or something). That avoids hitting the API every time you test step 2.
The response is generally JSON or some other format: handle it like the same data from a file.
Remember that APIs aren't 100% reliable: handle errors appropriately.
For example: in the Twitter API, you can get all of a user's tweets with the statuses/user_timeline
API call. You get back JSON data representing the tweets like:
[ { "created_at": "Thu Apr 06 12:34:56 +0000 2024", "text": "This is a tweet that I am tweeting.", ⋮ }, { "created_at": "Thu Apr 07 12:34:57 +0000 2024", "text": "This is a second tweet.", ⋮ } ]
Sometimes the data you need is available by HTTP, but not in a machine-friendly way. Then you end up fetching HTML pages and trying to extract the data from them.
This is likely against the terms-of-service of most web sites, and you're at their mercy if they block you or change their HTML structure.
Scraping is unpleasant but possible if the HTML is structured in a predictable way.
If the HTML is nice, you can use an XML parser (more later). If not, you can parse broken HTML with Beautiful Soup or similar.
There are some file formats that are commonly used and Python/NumPy/Pandas can help us a lot.
Remember that you always have the option of reading and manually constructing a DataFrame if the file you get is in some odd format.
CSV (Comma-Separated Values) is probably the most universal interchange format for data: can be read/written by spreadsheets and programs.
Cells within a row are separated by commas; rows are separated by newline. By convention, the first row is column headings. e.g.
city,population,area Vancouver,2463431,2878.52 Calgary,1392609,5110.21 Toronto,5928040,5905.71
The Python csv
module can read and write CSV files. Typically, you handle row-by-row.
For us, pd.read_csv
and DataFrame.to_csv
are probably more useful.
import pandas as pd city_data = pd.read_csv('city_data.csv', index_col='city') print(city_data)
population area city Vancouver 2463431 2878.52 Calgary 1392609 5110.21 Toronto 5928040 5905.71
CSV files can be harder than they look. Is the first line really headings? Which CSV dialect? Should a particular string be parsed to a date/boolean/NaN?
Be careful when you think it's just CSV
. Watch the read_csv
arguments.
JSON (JavaScript Object Notation) is a subset of JavaScript syntax for specifying values.
Very flexible. Very commonly understood. Very commonly used for APIs and data interchange.
JSON values:
{key1: value1, key2: value2}
[a, b, c]
"abc\n"
12
, 34.567
true
, false
, null
Again, there's a built-in Python json
module.
And also Pandas' pd.read_json
and DataFrame.to_json
.
There's even more choice when it comes to how to represent data as JSON. Pandas will read/write several representations.
city_data = pd.read_csv('city_data.csv') print(city_data.to_json())
{"city":{"0":"Vancouver","1":"Calgary","2":"Toronto"},"population":{"0":2463431,"1":1392609,"2":5928040},"area":{"0":2878.52,"1":5110.21,"2":5905.71}}
print(city_data.to_json(orient='records'))
[{"city":"Vancouver","population":2463431,"area":2878.52},{"city":"Calgary","population":1392609,"area":5110.21},{"city":"Toronto","population":5928040,"area":5905.71}]
The one-object-per-line format is very common in the big data world.
print(city_data.to_json(orient='records', lines=True))
{"city":"Vancouver","population":2463431,"area":2878.52} {"city":"Calgary","population":1392609,"area":5110.21} {"city":"Toronto","population":5928040,"area":5905.71}
Each line is a JSON object, but the file as a whole isn't legal JSON. Sometimes NDJSON (newline-delimited JSON, .ndjson
).
But it's totally possible that you'll find a .json
file that is in a format that Pandas can't read directly.
[{"Vancouver": [2463431, 2878.52]}, {"Calgary": [1392609, 5110.21]}, {"Toronto": [5928040, 5905.71]}]
Then: programming.
def input_to_rows(in_data): """ Take weird input format & generate (city,pop,area) tuples. """ for record in in_data: city, (population, area) = next(iter(record.items())) yield city, population, area orig_data = json.loads(json_data) city_data = pd.DataFrame(input_to_rows(orig_data), columns=['city', 'population', 'area']) print(city_data)
city population area 0 Vancouver 2463431 2878.52 1 Calgary 1392609 5110.21 2 Toronto 5928040 5905.71
Aside: be careful when manipulating data like this. It's easy to end up with many copies of the data in memory.
Use iterators/generators where you can. Minimize intermediate steps.
Use del
to remove copies if necessary, or write functions and have local variables destroyed automatically.
XML (eXtensible Markup Language) is another format for structured data. It's even more flexible than JSON, which is both good and bad.
Basically: like HTML, but with any tags/attributes you need for your data.
It's easy to come up with an XML format for whatever data you have.
<?xml version="1.0" ?> <cities> <city area="2878.52" population="2463431">Vancouver</city> <city area="5110.21" population="1392609">Calgary</city> <city area="5905.71" population="5928040">Toronto</city> </cities>
… but there are so many choices about storing values as attributes/text/sub-elements, that it's hard to decide.
Realistically, if you have a choice, use JSON.
If you get XML, then you have to deal with it.
There are many standard XML formats for common data types. e.g.
There are many libraries to read and write XML documents. Choose your favourite.
If you have done DOM manipulation with JavaScript, then xml.dom
and xml.dom.minidom
will feel comfortable.
Or, xml.etree.elementtree
or work-alikes like lxml.
Usually XML data files are predictable enough that extracting the data is fairly straightforward.
Different libraries may have nicer/uglier APIs. They may be faster/slower. They are fundamentally all doing the same job.
from xml.dom.minidom import parseString def element_to_data(elem): population = int(elem.getAttribute('population')) area = float(elem.getAttribute('area')) city = elem.childNodes[0].data return city, population, area city_doc = parseString(xml_data) city_elements = city_doc.getElementsByTagName('city') city_data = pd.DataFrame(list(map(element_to_data, city_elements)), columns=['city', 'population', 'area']) print(city_data)
city population area 0 Vancouver 2463431 2878.52 1 Calgary 1392609 5110.21 2 Toronto 5928040 5905.71
import xml.etree.ElementTree as ET def element_to_data(elem): population = int(elem.get('population')) area = float(elem.get('area')) city = elem.text return city, population, area root = ET.fromstring(xml_data) city_elements = root.iter('city') city_data = pd.DataFrame(list(map(element_to_data, city_elements)), columns=['city', 'population', 'area']) print(city_data)
city population area 0 Vancouver 2463431 2878.52 1 Calgary 1392609 5110.21 2 Toronto 5928040 5905.71
Aside: Most XML parsers read the whole file into memory, which is usually okay. If you have a large file (like a Wikipedia data dump: a single 50 GB XML file), then that's not practical.
Some XML parsers can give you elements as they are found in the input file: streaming parsers. e.g.
from lxml import etree doc = etree.iterparse(xml_file, events=('start','end')) for event, elem in doc: if event == 'start' and elem.tag == 'city': print(elem.text) # ... or whatever else you need elem.clear()
There is a limitless variety of other systems and file formats where your data might live. e.g.
Figuring out how to get the data out and into a usable format will be the first challenge.