Getting Data

CMPT 353

Where Data Comes From

Before you can analyse any data, you need to get it into your program.

Possibilities are infinite. A few are common…

Data from Files

Files are probably the most common place you'll get data before processing. You get a file in some format you can deal with and read it in.

In general: it's just programming. Read each line. Process it in whatever what is necessary. Put results in an array/DataFrame/whatever.

Data from Files

There are several common file formats you might come across, and the libraries will help you. We will talk about a few (later).

As usual: don't reinvent solutions if you can just call a function.

Databases

Databases are also common. You connect to the DB with some library and make (SQL) queries to get the data you need.

You'll likely iterate through the rows and do… something.

Won't come up much in this course, but take CMPT 354.

Web APIs

You may get the data you need by calling a web service (or REST API). Basically: you make HTTP requests to the right URL(s) and get back responses.

The first step in your pipeline might be to query the API and store the results locally (in a file, or DB, or something). That avoids hitting the API every time you test step 2.

Web APIs

The response is generally JSON or some other format: handle it like the same data from a file.

Remember that APIs aren't 100% reliable: handle errors appropriately.

Web APIs

For example: in the Twitter API, you can get all of a user's tweets with the statuses/user_timeline API call. You get back JSON data representing the tweets like:

[
    {
        "created_at": "Thu Apr 06 12:34:56 +0000 2024",
        "text": "This is a tweet that I am tweeting.",
        ⋮
    },
    {
        "created_at": "Thu Apr 07 12:34:57 +0000 2024",
        "text": "This is a second tweet.",
        ⋮
    }
]

Scraping HTML

Sometimes the data you need is available by HTTP, but not in a machine-friendly way. Then you end up fetching HTML pages and trying to extract the data from them.

This is likely against the terms-of-service of most web sites, and you're at their mercy if they block you or change their HTML structure.

Scraping HTML

Scraping is unpleasant but possible if the HTML is structured in a predictable way.

If the HTML is nice, you can use an XML parser (more later). If not, you can parse broken HTML with Beautiful Soup or similar.

File Formats

There are some file formats that are commonly used and Python/NumPy/Pandas can help us a lot.

Remember that you always have the option of reading and manually constructing a DataFrame if the file you get is in some odd format.

CSV

CSV (Comma-Separated Values) is probably the most universal interchange format for data: can be read/written by spreadsheets and programs.

Cells within a row are separated by commas; rows are separated by newline. By convention, the first row is column headings. e.g.

city,population,area
Vancouver,2463431,2878.52
Calgary,1392609,5110.21
Toronto,5928040,5905.71

CSV

The Python csv module can read and write CSV files. Typically, you handle row-by-row.

For us, pd.read_csv and DataFrame.to_csv are probably more useful.

import pandas as pd
city_data = pd.read_csv('city_data.csv', index_col='city')
print(city_data)

           population     area
city                          
Vancouver     2463431  2878.52
Calgary       1392609  5110.21
Toronto       5928040  5905.71

CSV

CSV files can be harder than they look. Is the first line really headings? Which CSV dialect? Should a particular string be parsed to a date/boolean/NaN?

Be careful when you think it's just CSV. Watch the read_csv arguments.

JSON

JSON (JavaScript Object Notation) is a subset of JavaScript syntax for specifying values.

Very flexible. Very commonly understood. Very commonly used for APIs and data interchange.

JSON

JSON values:

Objects (dicts): {key1: value1, key2: value2}
Arrays (lists): [a, b, c]
Strings: "abc\n"
Numbers: 12, 34.567
Special values: true, false, null

JSON

Again, there's a built-in Python json module.

And also Pandas' pd.read_json and DataFrame.to_json.

JSON

There's even more choice when it comes to how to represent data as JSON. Pandas will read/write several representations.

city_data = pd.read_csv('city_data.csv')
print(city_data.to_json())

{"city":{"0":"Vancouver","1":"Calgary","2":"Toronto"},"population":{"0":2463431,"1":1392609,"2":5928040},"area":{"0":2878.52,"1":5110.21,"2":5905.71}}

print(city_data.to_json(orient='records'))

[{"city":"Vancouver","population":2463431,"area":2878.52},{"city":"Calgary","population":1392609,"area":5110.21},{"city":"Toronto","population":5928040,"area":5905.71}]

JSON

The one-object-per-line format is very common in the big data world.

print(city_data.to_json(orient='records', lines=True))

{"city":"Vancouver","population":2463431,"area":2878.52}
{"city":"Calgary","population":1392609,"area":5110.21}
{"city":"Toronto","population":5928040,"area":5905.71}

Each line is a JSON object, but the file as a whole isn't legal JSON. Sometimes NDJSON (newline-delimited JSON, .ndjson).

JSON

But it's totally possible that you'll find a .json file that is in a format that Pandas can't read directly.

[{"Vancouver": [2463431, 2878.52]},
{"Calgary": [1392609, 5110.21]},
{"Toronto": [5928040, 5905.71]}]

JSON

Then: programming.

def input_to_rows(in_data):
    """
    Take weird input format & generate (city,pop,area) tuples.
    """
    for record in in_data:
        city, (population, area) = next(iter(record.items()))
        yield city, population, area

orig_data = json.loads(json_data)
city_data = pd.DataFrame(input_to_rows(orig_data),
            columns=['city', 'population', 'area'])
print(city_data)

        city  population     area
0  Vancouver     2463431  2878.52
1    Calgary     1392609  5110.21
2    Toronto     5928040  5905.71

JSON

Aside: be careful when manipulating data like this. It's easy to end up with many copies of the data in memory.

Use iterators/generators where you can. Minimize intermediate steps.

Use del to remove copies if necessary, or write functions and have local variables destroyed automatically.

XML

XML (eXtensible Markup Language) is another format for structured data. It's even more flexible than JSON, which is both good and bad.

Basically: like HTML, but with any tags/attributes you need for your data.

XML

It's easy to come up with an XML format for whatever data you have.

<?xml version="1.0" ?>
<cities>
  <city area="2878.52" population="2463431">Vancouver</city>
  <city area="5110.21" population="1392609">Calgary</city>
  <city area="5905.71" population="5928040">Toronto</city>
</cities>

… but there are so many choices about storing values as attributes/text/sub-elements, that it's hard to decide.

XML

Realistically, if you have a choice, use JSON.

If you get XML, then you have to deal with it.

XML

There are many standard XML formats for common data types. e.g.

GPX: geographical (latitude, longitude, elevation) data.
MusicXML: musical notation/scores.
Atom: blog/news feeds.
OpenDocument: word processing documents, spreadsheets, etc.
MathML: mathematical formulas.
SVG: Scalable Vector Graphics.

XML

There are many libraries to read and write XML documents. Choose your favourite.

If you have done DOM manipulation with JavaScript, then xml.dom and xml.dom.minidom will feel comfortable.

Or, xml.etree.elementtree or work-alikes like lxml.

XML

Usually XML data files are predictable enough that extracting the data is fairly straightforward.

Different libraries may have nicer/uglier APIs. They may be faster/slower. They are fundamentally all doing the same job.

XML

from xml.dom.minidom import parseString

def element_to_data(elem):
    population = int(elem.getAttribute('population'))
    area = float(elem.getAttribute('area'))
    city = elem.childNodes[0].data
    return city, population, area

city_doc = parseString(xml_data)
city_elements = city_doc.getElementsByTagName('city')
city_data = pd.DataFrame(list(map(element_to_data, city_elements)),
                         columns=['city', 'population', 'area'])
print(city_data)

        city  population     area
0  Vancouver     2463431  2878.52
1    Calgary     1392609  5110.21
2    Toronto     5928040  5905.71

XML

import xml.etree.ElementTree as ET

def element_to_data(elem):
    population = int(elem.get('population'))
    area = float(elem.get('area'))
    city = elem.text
    return city, population, area

root = ET.fromstring(xml_data)
city_elements = root.iter('city')
city_data = pd.DataFrame(list(map(element_to_data, city_elements)),
                         columns=['city', 'population', 'area'])
print(city_data)

        city  population     area
0  Vancouver     2463431  2878.52
1    Calgary     1392609  5110.21
2    Toronto     5928040  5905.71

XML

Aside: Most XML parsers read the whole file into memory, which is usually okay. If you have a large file (like a Wikipedia data dump: a single 50 GB XML file), then that's not practical.

XML

Some XML parsers can give you elements as they are found in the input file: streaming parsers. e.g.

from lxml import etree
doc = etree.iterparse(xml_file, events=('start','end'))
for event, elem in doc:
    if event == 'start' and elem.tag == 'city':
        print(elem.text) # ... or whatever else you need
    elem.clear()

Others

There is a limitless variety of other systems and file formats where your data might live. e.g.

GoPro metadata in an MP4 video;
Logstash data in an Elasticsearch database;
Sales/customer/contract data in a SalesForce deployment;
Sensor data published to a Kafka or RabbitMQ stream.

Figuring out how to get the data out and into a usable format will be the first challenge.