From Data to Code#

Objectives#

  • To explore a “real-world” dataset for use with Python.

  • To use Python to interact with data in a common, widely used format (JSON).

  • To explore the data structures and syntactical patterns of the Python language.

Instructions#

The following notebook has cells containing Python code for interacting with an external data file.

  1. Before you begin, each team should assign the following roles:

    • Notetaker (records the outcome of team discussions)

    • Reporter (speaks for the team when sharing out)

    • Advocate (makes sure everyone has a chance to contribute)

  2. Read the section below, Getting to Know Your Data, and as a team, do the exercises marked Try it out! and For discussion.

  3. Teams will report out to the larger group.

  4. You’ll work through the code in the notebook as a team.

    • Run each cell of code in this notebook (by pressing Control + Enter on a PC, Command + Return on a Mac). The output of running the code will appear below the cell.

    • Each cell is accompanied by annotations, and in some cases, questions for discussion. Discuss your responses to these questions with your team. The note taker should brielfy document the conversation (including any further questions or points of confusion that arise).

    • Blank cells labeled Try it out! invite you to write your own code based on the provided examples. Run your code and discuss the output with your team.

  5. Once everyone has worked through the notebook, we’ll review the questions and your responses in the larger group.

I. Getting to Know Your Data#

I.1 Data have structure#

About the Data

Over the next few days, we’ll be working together on a dataset containing information about textbooks assigned by courses at GW for the Fall 2023 semester.

Textbooks are linked from GW’s Schedule of Classes. (Note that we’re using data for courses taught on the Foggy Bottom campus only.)

The Find Books link under each course entry (see image below) leads to a page on the website of the GW Campus Store which contains a listing of textbooks for the course, along with the price of each book (through the GW Campus Store).

Screenshot of a row from the GW Schedule of Classes table, showing information for the course Consuming Asian America. The link with text "Find Books" is highlighted by a red box.

Try it out!

  1. Take a moment to look at the Schedule of Classes and the linked textbook data.

  2. As a team, discuss possible uses for data about courses and textbooks. Who are the potential users, and what purposes might they have for these data? Make notes on your answers. (We call these scenarios user stories.)

  3. After generating a few user stories, discuss how you might organize such a dataset. Try not to focus too much on how the data is presented on the GW or GW Campus Store websites. Rather, given your user stories, and using what you know about courses and textbooks at GW, consider the following questions:

    • What data elements would be useful to capture?

    • How do those data elements relate to each other? (For example, a course might have multiple sections, and each section might have a separate instructor.)

    • How would you structure (organize) a dataset so as to capture those relationships?

  4. Make a drawing that models the data structure your team discussed in answer to the previous questions.

I.2 Data have a format#

When we talk about data structure, we mean how the elements of a dataset relate to each other. The structure describes something about that part of the world to which the data refer.

The data format, on the other hand, is how the dataset is represented by the computer. The same format can be used to represent many different kinds of data structures, just as data with a given structure can be represented in many different formats. (As an analogy, think about music: a Baroque fugue and a blues song have very different musical structures, but both can be recorded as an MP3, which in this case is the data format.)

The dataset we’re using is in JSON format. JSON (usually pronounced jay-sawn) is a common format for sharing data on the web. It’s not as concise or human-readable as some formats (e.g., CSV, which is often used for sharing tabular data). But it has a few advantages that make it popular with programmers:

  • JSON data can be deeply nested, reflecting hierarchical relationships between data elements.

  • The JSON format comprises structures that map well onto the most common data structures used by modern programming languages.

  • The JSON syntax has a lot in common with languages like Python.

We’ll explore these three aspects of JSON today.

For discussion

Open the GW bookstore dataset in your browser. As a team, discuss the following questions:

  1. What do you notice about the structure of this dataset? How are the various elements related to one another?

  2. How does it compare to the model you drew in the previous exercise?

  3. What data elements seem to be always present throughout the dataset?

  4. Are there elements that appear only sometimes? If so, make a list of those.

  5. Which elements might be useful for the user stories you developed in the first exercise?

  6. Do you see any elements that suggest user stories you hadn’t thought of?

Now draw the structure of this dataset, based on what you’ve observed.

II. Working with Data in Code#

II.1 Loading the dataset#

The code below will fetch data from a URL and save it locally as a file before loading the data in your notebook.

  1. We import a Python function called urlretrieve and a Python module called json.

    • The urlretrieve function allows us to fetch data from a remote source and save a local copy.

    • The json module allows us to convert data in JSON format to Python types. (More on types below).

  2. We use the open function to open the file for reading. The file is called bookstore-data.json, where the .json extension indicates that this is a JSON-formatted file. (The file extension is part of the filename, like .docx for Word documents or .xlsx for Excel spreadsheets.)

  3. The file is assigned to the temporary variable f.

  4. We use the json.load method to read the file (f). This method is specifically designed for JSON files; it won’t work if the file does not contain data in valid JSON format.

  5. The contents of the file, as processed by json.load, are assigned to a new variable, bkst_data.

from urllib.request import urlretrieve
import json
urlretrieve('https://go.gwu.edu/pythoncampdata', 'bookstore-data.json')
with open('bookstore-data.json') as f:
    bkst_data = json.load(f)

II.2 Navigating lists#

The type function provides as output information about the value associated with the variable bkst_data. It tells us the name of the Python data structure that characterizes this value.

type(bkst_data)

Question

You have encountered a Python list before. What was the name of the variable that held a list in the Modeling Code exercise?

Notes

  • Going forward, when referring to functions and their output, we will say that the function returns something.

  • Every Python value has a defined type. Types you’ve encountered up to this point, and which you’ll learn more about in tonight’s homework, include integers, floats, strings, and lists.

  • When we use the word variable, we’re generally referring to the combination of a name and a value. The name points to the value, which is located somewhere in memory. When we say, “the variable bkst_data is a list,” we mean that the value to which the name bkst_data points is represented in Python as a list.

bkst_data[0]

The code above uses indexing to access the first element in the bkst_data list. When you run it, you should see data enclosed in curly braces ({}).

Try it out!

In the cell below, use indexing to look at other elements in the list. (Change the number inside the square brackets in the expression bkst_data[0]).

# Your code here

Try it out!

For an example of course with assigned textbooks, look at the element in the 2nd position in the list (at index [1]).

Note any additional data elements that might be useful, as well as any that you have questions about.

# Your code here

II.3 Navigating nested data#

The len() function returns the length of a list.

len(bkst_data)

There are 5,030 top-level elements in bkst_data. But each element contains other elements nested within it.

Here we assign one of those elements – the element in the 2nd position – to a new variable, my_course.

my_course = bkst_data[1]
my_course
type(my_course)

A single course is represented as a Python dictionary (dict) within the bkst_data list.

Dictionaries allow us to store data in fields, similar to a database. The elements on the left-hand side of the colons are called keys, and the elements on the right-hand side of the colons are called values.

Here the keys are strings; anything enclosed in quotation marks in Python is a string.

We can use the keys to retrieve the values from the dictionary.

my_dept = my_course['department']
my_dept

Try it out!

Practice accessing other elements within my_course using keys.

# Your code here

Questions

Which pieces of information in my_course can you NOT access this way?

The textbooks associated with this course/section are stored within a nested element: a list associated with the key 'texts'.

type(my_course['texts'])

This particular course/section has three entries under the texts heading.

len(my_course['texts'])

Try it out!

Use indexing to look at each of the items in the my_books variable below.

my_books = my_course['texts']
# Your code here

A note on nested data

Working with nested data can be a headache. Unfortunately, because of how databases tend to store data, there’s a lot of it out there! Here’s a rule of thumb that can help: when working with nested data, always try to keep track of the level at which you’re working. Variables can be useful for this purpose.

  1. For instance, in the code above, the variable bkst_data holds a list of courses.

  2. We point to a single course (the second) using the variable my_course. If we wanted to point to a different course, we could simply reassign the variable, e.g., my_course = bkst_data[100]. After running that bit of code, my_course would point to the 101st course in the list.

  3. The my_books variable points to the list of textbooks associated with the course in my_course (either the second course or whatever other course we assigned to it).

If you find visual analogies useful, think of our dataset as follows:

  • As a file cabinet, where each course corresponds to a separate drawer, and within each drawer, there is a folder containing sheets of paper about textbooks. (The folder may be empty.)

  • As a series of nesting dolls, or boxes inside boxes.

  • Or if you’re a Marvel fan, as the multiverse, wherein each universe may or may not have its own version of your favorite superheroes.

A note on variable names

Note that it’s purely a matter of convention that my_course uses the singular noun (“course”) because it’s pointing to a single course, and that my_books uses the plural (“books”) because it’s pointing to a list. These are helpful hints for the programmer, but they mean nothing to Python. We could name our variables spiderman and doctor_strange, if we wanted to, and they would work just as well (though we might be more prone to make mistakes!).