Programming with Data#
Objectives#
To use lists, loops, and conditionals to answer basic questions about a dataset
To design data structures with specific purposes (use cases) in mind
Taking stock
At this point, you’ve covered almost all the essential parts of the Python language. In particular, you have practiced
Using integers, floats, and strings to represent numbers and text
Creating lists to store multiple pieces of information
Creating dictionaries to store information in a more structured way
Using for loops to manipulate nested data structures
Writing functions to organize code in a logical way and reduce redundancy
With these tools, you have almost everything you need to write Python code for “real-world” situations, like analyzing a dataset.
Today you’ll equip yourself with a couple more tools, and then you and your team will start to tackle some more complex problems with Python.
A reminder
But if you feel a little overwhelmed at this point, that’s to be expected! Learning a programming language, especially your first language, can be a long process, with periods of excitement alternating with periods of frustration.
As we forge ahead today and tomorrow, please keep these two things in mind:
You should be proud of yourself for having made it this far!
When you feel stuck, try to have compassion for yourself, take a deep breath, and even walk away for a bit if you need to.
Human minds are not computers. When a computer program gets stuck, it means that something’s off – an error in the code, a hardware issue, an unexpected piece of data – and the program won’t work until the problem has been fixed (usually, by outside intervention). But when the human mind feels stuck, that is usually a sign that learning is happening. Today’s frustrations pave the way for tomorrow’s epiphanies.
Learning means getting stuck, and sticking with it.
I. What if?#
Programming is a powerful tool because it can respond to different situations, provided we can anticipate them. The real world is full of variation, and useful programs are those that account for some amount of variation. After all, it would hardly be efficient if we had to re-write our code from scratch whenever we encountered a new dataset, for instance, or if we had no way of dealing with different users’ preferences.
Conversely, think of how frustrating it is to encounter an application or a website that doesn’t seem designed to let you do what it seems obvious that you should be able to do with it.
At the core of the programmer’s ability to anticipate and cope with variable user preferences, data structures, and computing environments is conditional logic.
I.1 Conditionals & Boolean values#
In a very abstract sense, a computer is a machine for implementing binary logic. In binary logic, the only values allowed are 1
’s and 0
’s, which represent True
and False
, respectively.
Thus, at some level, everything we do in computation can be reduced to True
or False
(from the computer’s perspective). But from a programmer’s perspective, this usually only matters in situations where we want the program to do different things based on certain conditions that may or may not obtain. These cases are called Boolean expressions.
For instance, we can tell Python to compare two numbers, using the standard operators you might remember from your math courses: greater than, less than, equal to, etc.
Note that in Python, we represent equality by the double equals sign (==
). A single equals sign is reserved for variable assignment.
Running the code below should return True
, which is one of two special Python values called Boolean values (so named because the binary logic that computers implement was invented by the mathematician George Boole).
book_price = 55.99 # Assignment: single equals sign
book_price < 60
True
Try it out!
Write an expression with the book_price
variable that returns False
.
# Your code here
I.2 if
statements#
Usually, we want to do more than evaluate whether an expression (like book_price < 60
) is True
or False
. Usually, we want the program to take some action based on the outcome of that evaluation.
For this, we use an if statement.
To print a message if the value of the book_price
variable is above a certain threshhold, we can write the following:
if book_price > 100:
print("That's an expensive textbook!")
Running the code above produces no output, at least not if book_price
is assigned as above (to the float value 55.99
).
Try it out!
Change the value of book_price
in the preceding code cell so that the condition is True
.
#Your code here
1.3 Iffy operations#
What if we want to check for books within a certain range of prices?
We can use the Boolean operator and
to do this. The and
keyword links two conditions: if both sub-conditions are True
, then the whole condition (with the and
) is also True
. Otherwise, it is False
.
The other common Boolean operators are or
and not
. or
is True
if at least one sub-condition is True
. not
flips (inverts) a condition: so not True
is False
.
The following truth table summarizes the basic results of using and
and or
with two variables, A
and B
. Assume both A
and B
refer to some Boolean expression, such as 4 < 5
(which would be True
) or 1 == 2
(which would be False
).
A |
B |
Operator |
Result |
---|---|---|---|
True |
True |
and |
True |
True |
True |
or |
True |
True |
False |
and |
False |
False |
True |
and |
False |
True |
False |
or |
True |
False |
True |
or |
True |
False |
False |
and |
False |
False |
False |
or |
False |
book_price = 55.99
if book_price >= 20 and book_price <= 100:
print("Not too expensive")
In the examples above, our code either performed an action or not, depending on a single condition. We can also specify multiple conditions, only one of which can be true. For example, if book_price
is between 20
and 100
, the following code will print Not too expensive
), but it will print other messages if book_price
is less than 20
or greater than 100
.
if book_price >= 20 and book_price <= 100:
print("Not too expensive")
elif book_price < 20:
print("That's a relatively cheap textbook.")
else:
print("That's an expensive textbook!")
Not too expensive
Notes
Here are some rules of thumb for writing if
statements in Python:
You can have as many
elif
statements as you want, provided they follow anif
statement.The
else
statement is a catch-all: it will be executed if none of the precedingif
orelif
statements evaluates toTrue
.Otherwise, the first
if
/elif
statement that isTrue
will be executed, and all the rest will be ignored. In other words, if the conditions you’re testing for are not mutually exclusive, you should write the most specific test first.Each
if
/elif
/else
statement ends with a colon and is followed by an indented block of code. This is the same pattern we saw withfor
loops.
Try it out!
Take another look at our bookstore dataset as JSON. As you might recall, some but not all courses have textbooks listed. Courses that lack textbooks have a texts
key that points to an empty Python list ([]
).
How could you write an if
statement to determine whether a given course has any textbooks associated with it in our dataset?
The first code cell below defines a variable course
and assigns it to a dictionary for a course with textbooks. Run this code, then write some code in the cell below to check whether the course
variable has any textbooks. Your code should print the price of the first textbook if any textbooks are listed. Otherwise, your code should print a message like "No textbooks found"
.
Once you’ve written your code, run the third code cell below, which re-assigns the course
variable to a dictionary for a course without textbooks. Re-run your code cell to make sure that your code works for both situations.
Hint
To find the length of a list, you can use the built-in len()
function.
The length of an empty list is 0, so len([]) == 0
is True
.
course = {"department": "IAFF",
"course_num": "1002",
"section": "10",
"instructor": "Adas",
"texts": [{"title": "International Affairs: Theories and Practice",
"author": "Janice Witherspoon",
"publisher": "Addison/Wesley",
"price_display": "$100.75"}]}
#Your code here
course = {"department": "IAFF",
"course_num": "1001",
"section": "10",
"instructor": "Jaffrey",
"texts": []}
II. Loops & conditionals#
II.1 Counting with loops#
A very common use of a for loop in Python is to count, aggregate, or otherwise keep track of certain values when processing a list.
We’ve seen a version of this pattern before: in the homework, you use a loop to convert some prices from strings to floats and then to adjust the price with sales tax. You collected the adjusted prices in a new list.
In the example below, we’ll use this pattern to keep track of a single value. The value we’re tracking will simply be the number of items in the list.
We’ll use our bookstore dataset, so the first step is to read the file into Python from disk.
from urllib.request import urlretrieve
import json
urlretrieve('https://go.gwu.edu/pythoncampdata', 'bookstore-data.json')
with open('bookstore-data.json') as f:
bkst_data = json.load(f)
A recipe for counting
To use this pattern, you usually have at least three variables to deal with:
The variable that holds your list (
bkst_data
)A loop variable (
course
in the code below)A variable defined before the loop that will be used to track or accumulate values. Since this loop is just counting items, we’ll call this variable
counter
. We setcounter
initially to0
, since before we run the loop, we haven’t counted any items.
Note also that counter += 1
is shorthand for counter = counter + 1
. Either way of writing that expression is fine.
counter = 0
print("Counter before loop", counter)
for course in bkst_data:
counter += 1
print("Counter after loop", counter)
Question
After running the code above, discuss it with your team. Do you know of another way to accomplish the same thing?
Hint: think about the built-in Python functions you’ve already encountered.
II.2 Counting with conditionals#
Admittedly, there are more concise ways to calculate the length of a list than using the for loop and the counter
variable above. But what if we wanted to count something a bit more complex?
Try it out!
Building on the recipe above, can you write some code to count how many courses in our bookstore dataset have textbooks?
#Your code here
III. Working with data#
Now you have seen how lists, loops, conditionals, and variables work together to allow us to perform non-trivial computations with data. The next step is to develop your programmer’s intuition by putting these tools into practice with more complex tasks.
As a team, you should work through the exercises below. For each exercise, there’s a suggestion for a both a less challenging and a more challenging approach. Before starting each exercise, each team should discuss the approach they plan to take. If everyone in the group feels up to the more challenging version of the exercise, you’re welcome to skip the less challenging version. But if not everyone feels up to the former, we strongly recommend starting as a team with the less challenging approach. Extra practice never hurts!
A less challenging dataset
Our bookstore dataset has a few thousand elements and a non-trivially nested structure. For a less challenging dataset, you can use the team dataset you produced during Day 2’s activity “Describing the Team.” Assuming you saved that dataset to a JSON file on JupyterHub called team-dataset.json
, the following code should open the file and store its contents in a variable called my_team
.
with open("./team-dataset.json") as f:
my_team = json.load(f)
Can’t find your team data?
If the code above shows a FileNotFound
error, you can use a prepared dataset that has more or less the same structure as your team dataset (with randomly generated names).
Copy the code below into a new code cell and run it:
urlretrieve('https://go.gwu.edu/pythoncampdata2', 'team-dataset-prepared.json')
with open('team-dataset-prepared.json') as f:
my_team = json.load(f)
III.1 Computing averages#
We can compute an average by dividing some total quantity by the number of occurences of each quantity. For instance, let’s say we have a list prices
defined as follows:
prices = [100.5, 75, 85.75, 90, 55]
We can compute the average price like this:
avg_price = sum(prices) / len(prices)
Note that the sum()
function works only on numeric types (integers and floats).
Try it out!
Less Challenging
Each dictionary in your team dataset should include a key "years_at_gw"
. Calculate the average number of years at GW for members of your team.
More Challenging
Calculate the average cost of a textbook in the bookstore dataset.
# Your code here
Hint
For the less challenging version of the problem, you might take this approach:
Define a variable called
total_years
.Use a for loop to iterate over the
my_team
list.For each
member
inmy_team
(assumingmember
is your loop variable), add the value ofmember["years_at_gw"]
to thetotal_years
variable.Outside the loop, divide the
total_years
variable by the length of themy_team
list.
For the more challenging version, a similar approach could work, with these differences:
Each textbook’s price corresponds to the
"price_display"
key of a dictionary. These dictionaries are stored in a list nested under the dictionary corresponding to each course.Because there are two levels of list – a list of courses, and under each course, a list of textbooks – you’ll probably want to use a nested for loop.
Each price is stored as a string, so you’ll want to convert it to a float before doing any math with it.
You’ll also need to keep track – perhaps in a separate variable – of the total number of textbooks. Note that
len(bkst_data)
will return the number of courses, not the number of textbooks.
III.2 Finding the maximum and minimum#
For this exercise, feel free to reuse and modify the code you write for exercise III.1.
Try it out!
Less challenging
Use your team dataset to identify the team member with the longest name (the string containing the most characters).
More challenging
Identify the most expensive and/or the least expensive textbook in the bookstore dataset.
# Your code here
Hint
This exercise is subtly different from the first. When finding the average, we were computing a single number. Here we want to find the data element that satisfies the criterion (e.g., max or minimum price, maximum length of name).
The following considerations may prove important:
We’ll need to keep track of both the maximum (or minumum) value we’ve seen so far (e.g., price, name length), and the textbook or team member associated with that value. One way to do that is to define two variables outside of the for loop, for instance:
max_price = 0
most_expensive_textbook = {}
Inside the for loop – or the inner for loop, if you’re working with the textbook data – you may want to use an if statement to compare the current value to the maximum value you’ve seen so far, and to update the latter accordingly.
You need to decide how to handle situations where there is more than one data element satisfying the maximum/minimum criterion. One option is to define the variable
most_expensive_textbook
(orlongest_name
) as a list, and then to use the list’s append() method to add a new element to it whenever one satisfies the criterion.
III.3 Restructuring Data#
When working with data, sometimes we’re not trying to find a single answer, like the average price of textbooks or the most expensive textbook. Rather, we want to transform our dataset to make it more useful in answering multiple questions.
In building your team dataset and analyzing the bookstore dataset, you’ve worked with lists of dictionaries, which is a fairly common data structure you’ll encounter in Python (and other languages). It’s probably the most intuitive way, for instance, to represent data from a spreadsheet or other tabular format.
The only problem with a list in Python is that as the list grows longer, finding individual elements becomes less efficient. In such cases, it might be more useful to store the data in a dictionary. Even in a dictionary of dictionaries, if your data happens to be nested!
Try it out!
Less challenging
Convert the team dataset into a Python dictionary, where each key is the team member’s name, and the associated value is the information you’ve collected about that team member.
More challenging
Imagine that you’re managing the bookstore’s data, and you get a request from GW Libraries for a dataset they can use to look up books by ISBN. (The ISBN is the International Standard Book Number; most commercially books published today have an ISBN, which serves as the book’s unique identifier.)
Create a Python dictionary out of all the textbooks in the bookstore dataset, where the keys are ISBN’s, and the values contain the information about each book. If a book lacks an ISBN, you can leave it out of the dicionary.
# Your code here
Hint
To visualize the transformation of the team dataset, the following visual model might be useful:
You can accomplish this with a single for loop. For the bookstore dataset, the procedure is similar, but you will need to use a nested for loop (because textbooks are nested under courses).
Once you have created your new, dictionary-based dataset, you should be able to use it look up items by their identifier.
For instance, if my_team_dict
is a dictionary of information about your team, you should be able to write my_team_dict["Alex"]
to call up the information about a team member named Alex.
And if isbn_text_dict
is a dictionary of textbooks stored by ISBN, you should be able to enter something like isbn_text_dict["9782370210371"]
to retrieve the information about that particular book (Regarde les Lumieres mon Amour).
Wrap up#
Today you did the following:
Used Python dictionaries to store and update data about your team.
Used conditionals, for loops, and the “counter” pattern to answers basic questions about a dataset.
Transformed the structure of a dataset to facilitate a different kind of access.
In the homework tonight, you’ll practice some techniques for troubleshooting code when errors arise.
And tomorrow, for the final day of Python Camp, you’ll work with your team to design and write some code from scratch to address a user story that your team comes up with.