From Data to Code#
Objectives#
To explore a “real-world” dataset for use with Python.
To use Python to interact with data in a common, widely used format (JSON).
To explore the data structures and syntactical patterns of the Python language.
Instructions#
The following notebook has cells containing Python code for interacting with an external data file.
Before you begin, each team should assign the following roles:
Notetaker (records the outcome of team discussions)
Reporter (speaks for the team when sharing out)
Advocate (makes sure everyone has a chance to contribute)
Read the section below, Getting to Know Your Data, and as a team, do the exercises marked
Try it out!
andFor discussion
.Teams will report out to the larger group.
You’ll work through the code in the notebook as a team.
Run each cell of code in this notebook (by pressing
Control
+Enter
on a PC,Command
+Return
on a Mac). The output of running the code will appear below the cell.Each cell is accompanied by annotations, and in some cases, questions for discussion. Discuss your responses to these questions with your team. The note taker should brielfy document the conversation (including any further questions or points of confusion that arise).
Blank cells labeled
Try it out!
invite you to write your own code based on the provided examples. Run your code and discuss the output with your team.
Once everyone has worked through the notebook, we’ll review the questions and your responses in the larger group.
I. Getting to Know Your Data#
I.1 Data have structure#
About the Data
Over the next few days, we’ll be working together on a dataset containing information about textbooks assigned by courses at GW for the Fall 2023 semester.
Textbooks are linked from GW’s Schedule of Classes. (Note that we’re using data for courses taught on the Foggy Bottom campus only.)
The Find Books
link under each course entry (see image below) leads to a page on the website of the GW Campus Store which contains a listing of textbooks for the course, along with the price of each book (through the GW Campus Store).
Try it out!
Take a moment to look at the Schedule of Classes and the linked textbook data.
As a team, discuss possible uses for data about courses and textbooks. Who are the potential users, and what purposes might they have for these data? Make notes on your answers. (We call these scenarios user stories.)
After generating a few user stories, discuss how you might organize such a dataset. Try not to focus too much on how the data is presented on the GW or GW Campus Store websites. Rather, given your user stories, and using what you know about courses and textbooks at GW, consider the following questions:
What data elements would be useful to capture?
How do those data elements relate to each other? (For example, a course might have multiple sections, and each section might have a separate instructor.)
How would you structure (organize) a dataset so as to capture those relationships?
Make a drawing that models the data structure your team discussed in answer to the previous questions.
For the curious
This dataset was obtained by scraping the GW Bookstore website. Because web scraping is a rather advanced topic, we won’t be covering that process in Python Camp.
The dataset was also pre-processed to simplify it somewhat (removing extraneous and redundant elements, etc.).
I.2 Data have a format#
When we talk about data structure, we mean how the elements of a dataset relate to each other. The structure describes something about that part of the world to which the data refer.
The data format, on the other hand, is how the dataset is represented by the computer. The same format can be used to represent many different kinds of data structures, just as data with a given structure can be represented in many different formats. (As an analogy, think about music: a Baroque fugue and a blues song have very different musical structures, but both can be recorded as an MP3, which in this case is the data format.)
The dataset we’re using is in JSON format. JSON (usually pronounced jay-sawn) is a common format for sharing data on the web. It’s not as concise or human-readable as some formats (e.g., CSV, which is often used for sharing tabular data). But it has a few advantages that make it popular with programmers:
JSON data can be deeply nested, reflecting hierarchical relationships between data elements.
The JSON format comprises structures that map well onto the most common data structures used by modern programming languages.
The JSON syntax has a lot in common with languages like Python.
We’ll explore these three aspects of JSON today.
For discussion
Open the GW bookstore dataset in your browser. As a team, discuss the following questions:
What do you notice about the structure of this dataset? How are the various elements related to one another?
How does it compare to the model you drew in the previous exercise?
What data elements seem to be always present throughout the dataset?
Are there elements that appear only sometimes? If so, make a list of those.
Which elements might be useful for the user stories you developed in the first exercise?
Do you see any elements that suggest user stories you hadn’t thought of?
Now draw the structure of this dataset, based on what you’ve observed.
II. Working with Data in Code#
II.1 Loading the dataset#
The code below will fetch data from a URL and save it locally as a file before loading the data in your notebook.
We import a Python function called
urlretrieve
and a Python module calledjson
.The
urlretrieve
function allows us to fetch data from a remote source and save a local copy.The
json
module allows us to convert data in JSON format to Python types. (More on types below).
We use the
open
function to open the file for reading. The file is calledbookstore-data.json
, where the.json
extension indicates that this is a JSON-formatted file. (The file extension is part of the filename, like.docx
for Word documents or.xlsx
for Excel spreadsheets.)The file is assigned to the temporary variable
f
.We use the
json.load
method to read the file (f
). This method is specifically designed for JSON files; it won’t work if the file does not contain data in valid JSON format.The contents of the file, as processed by
json.load
, are assigned to a new variable,bkst_data
.
from urllib.request import urlretrieve
import json
urlretrieve('https://go.gwu.edu/pythoncampdata', 'bookstore-data.json')
with open('bookstore-data.json') as f:
bkst_data = json.load(f)