Programming with Data

Programming with Data#

Objectives#

To use lists, loops, and conditionals to answer basic questions about a dataset
To design data structures with specific purposes (use cases) in mind

Taking stock

At this point, you’ve covered almost all the essential parts of the Python language. In particular, you have practiced

Using integers, floats, and strings to represent numbers and text
Creating lists to store multiple pieces of information
Creating dictionaries to store information in a more structured way
Using for loops to manipulate nested data structures
Writing functions to organize code in a logical way and reduce redundancy

With these tools, you have almost everything you need to write Python code for “real-world” situations, like analyzing a dataset.

Today you’ll equip yourself with a couple more tools, and then you and your team will start to tackle some more complex problems with Python.

A reminder

But if you feel a little overwhelmed at this point, that’s to be expected! Learning a programming language, especially your first language, can be a long process, with periods of excitement alternating with periods of frustration.

As we forge ahead today and tomorrow, please keep these two things in mind:

You should be proud of yourself for having made it this far!
When you feel stuck, try to have compassion for yourself, take a deep breath, and even walk away for a bit if you need to.

Human minds are not computers. When a computer program gets stuck, it means that something’s off – an error in the code, a hardware issue, an unexpected piece of data – and the program won’t work until the problem has been fixed (usually, by outside intervention). But when the human mind feels stuck, that is usually a sign that learning is happening. Today’s frustrations pave the way for tomorrow’s epiphanies.

Learning means getting stuck, and sticking with it.

I. What if?#

Programming is a powerful tool because it can respond to different situations, provided we can anticipate them. The real world is full of variation, and useful programs are those that account for some amount of variation. After all, it would hardly be efficient if we had to re-write our code from scratch whenever we encountered a new dataset, for instance, or if we had no way of dealing with different users’ preferences.

Conversely, think of how frustrating it is to encounter an application or a website that doesn’t seem designed to let you do what it seems obvious that you should be able to do with it.

At the core of the programmer’s ability to anticipate and cope with variable user preferences, data structures, and computing environments is conditional logic.

I.1 Conditionals & Boolean values#

In a very abstract sense, a computer is a machine for implementing binary logic. In binary logic, the only values allowed are 1’s and 0’s, which represent True and False, respectively.

Thus, at some level, everything we do in computation can be reduced to True or False (from the computer’s perspective). But from a programmer’s perspective, this usually only matters in situations where we want the program to do different things based on certain conditions that may or may not obtain. These cases are called Boolean expressions.

For instance, we can tell Python to compare two numbers, using the standard operators you might remember from your math courses: greater than, less than, equal to, etc.

Note that in Python, we represent equality by the double equals sign (==). A single equals sign is reserved for variable assignment.

Running the code below should return True, which is one of two special Python values called Boolean values (so named because the binary logic that computers implement was invented by the mathematician George Boole).

book_price = 55.99 # Assignment: single equals sign
book_price < 60

True

Try it out!

Write an expression with the book_price variable that returns False.

# Your code here

I.2 `if` statements#

Usually, we want to do more than evaluate whether an expression (like book_price < 60) is True or False. Usually, we want the program to take some action based on the outcome of that evaluation.

For this, we use an if statement.

To print a message if the value of the book_price variable is above a certain threshhold, we can write the following:

if book_price > 100:
    print("That's an expensive textbook!")

Running the code above produces no output, at least not if book_price is assigned as above (to the float value 55.99).

Try it out!

Change the value of book_price in the preceding code cell so that the condition is True.

#Your code here

1.3 Iffy operations#

What if we want to check for books within a certain range of prices?

We can use the Boolean operator and to do this. The and keyword links two conditions: if both sub-conditions are True, then the whole condition (with the and) is also True. Otherwise, it is False.

The other common Boolean operators are or and not. or is True if at least one sub-condition is True. not flips (inverts) a condition: so not True is False.

The following truth table summarizes the basic results of using and and or with two variables, A and B. Assume both A and B refer to some Boolean expression, such as 4 < 5 (which would be True) or 1 == 2 (which would be False).

A	B	Operator	Result
True	True	and	True
True	True	or	True
True	False	and	False
False	True	and	False
True	False	or	True
False	True	or	True
False	False	and	False
False	False	or	False

book_price = 55.99
if book_price >= 20 and book_price <= 100:
    print("Not too expensive")

In the examples above, our code either performed an action or not, depending on a single condition. We can also specify multiple conditions, only one of which can be true. For example, if book_price is between 20 and 100, the following code will print Not too expensive), but it will print other messages if book_price is less than 20 or greater than 100.

if book_price >= 20 and book_price <= 100:
    print("Not too expensive")
elif book_price < 20:
    print("That's a relatively cheap textbook.")
else:
    print("That's an expensive textbook!")

Not too expensive

Notes

Here are some rules of thumb for writing if statements in Python:

You can have as many elif statements as you want, provided they follow an if statement.
The else statement is a catch-all: it will be executed if none of the preceding if or elif statements evaluates to True.
Otherwise, the first if/elif statement that is True will be executed, and all the rest will be ignored. In other words, if the conditions you’re testing for are not mutually exclusive, you should write the most specific test first.
Each if/elif/else statement ends with a colon and is followed by an indented block of code. This is the same pattern we saw with for loops.

Try it out!

Take another look at our bookstore dataset as JSON. As you might recall, some but not all courses have textbooks listed. Courses that lack textbooks have a texts key that points to an empty Python list ([]).

How could you write an if statement to determine whether a given course has any textbooks associated with it in our dataset?

The first code cell below defines a variable course and assigns it to a dictionary for a course with textbooks. Run this code, then write some code in the cell below to check whether the course variable has any textbooks. Your code should print the price of the first textbook if any textbooks are listed. Otherwise, your code should print a message like "No textbooks found".

Once you’ve written your code, run the third code cell below, which re-assigns the course variable to a dictionary for a course without textbooks. Re-run your code cell to make sure that your code works for both situations.

Hint

To find the length of a list, you can use the built-in len() function.

The length of an empty list is 0, so len([]) == 0 is True.

course = {"department": "IAFF",
            "course_num": "1002",
            "section": "10",
            "instructor": "Adas",
            "texts": [{"title": "International Affairs: Theories and Practice",
                       "author": "Janice Witherspoon",
                       "publisher": "Addison/Wesley",
                       "price_display": "$100.75"}]} 

#Your code here

course = {"department": "IAFF",
            "course_num": "1001",
            "section": "10",
            "instructor": "Jaffrey",
            "texts": []}

II. Loops & conditionals#

II.1 Counting with loops#

A very common use of a for loop in Python is to count, aggregate, or otherwise keep track of certain values when processing a list.

We’ve seen a version of this pattern before: in the homework, you use a loop to convert some prices from strings to floats and then to adjust the price with sales tax. You collected the adjusted prices in a new list.

In the example below, we’ll use this pattern to keep track of a single value. The value we’re tracking will simply be the number of items in the list.

We’ll use our bookstore dataset, so the first step is to read the file into Python from disk.

from urllib.request import urlretrieve
import json
urlretrieve('https://go.gwu.edu/pythoncampdata', 'bookstore-data.json')
with open('bookstore-data.json') as f:
    bkst_data = json.load(f)

A recipe for counting

To use this pattern, you usually have at least three variables to deal with:

The variable that holds your list (bkst_data)
A loop variable (course in the code below)
A variable defined before the loop that will be used to track or accumulate values. Since this loop is just counting items, we’ll call this variable counter. We set counter initially to 0, since before we run the loop, we haven’t counted any items.

Note also that counter += 1 is shorthand for counter = counter + 1. Either way of writing that expression is fine.

counter = 0
print("Counter before loop", counter)
for course in bkst_data:
    counter += 1
print("Counter after loop", counter)

Question

After running the code above, discuss it with your team. Do you know of another way to accomplish the same thing?

Hint: think about the built-in Python functions you’ve already encountered.

II.2 Counting with conditionals#

Admittedly, there are more concise ways to calculate the length of a list than using the for loop and the counter variable above. But what if we wanted to count something a bit more complex?

Try it out!

Building on the recipe above, can you write some code to count how many courses in our bookstore dataset have textbooks?

#Your code here

III. Working with data#

Now you have seen how lists, loops, conditionals, and variables work together to allow us to perform non-trivial computations with data. The next step is to develop your programmer’s intuition by putting these tools into practice with more complex tasks.

As a team, you should work through the exercises below. For each exercise, there’s a suggestion for a both a less challenging and a more challenging approach. Before starting each exercise, each team should discuss the approach they plan to take. If everyone in the group feels up to the more challenging version of the exercise, you’re welcome to skip the less challenging version. But if not everyone feels up to the former, we strongly recommend starting as a team with the less challenging approach. Extra practice never hurts!

A less challenging dataset

Our bookstore dataset has a few thousand elements and a non-trivially nested structure. For a less challenging dataset, you can use the team dataset you produced during Day 2’s activity “Describing the Team.” Assuming you saved that dataset to a JSON file on JupyterHub called team-dataset.json, the following code should open the file and store its contents in a variable called my_team.

with open("./team-dataset.json") as f:
    my_team = json.load(f)

Can’t find your team data?

If the code above shows a FileNotFound error, you can use a prepared dataset that has more or less the same structure as your team dataset (with randomly generated names).

Copy the code below into a new code cell and run it:

urlretrieve('https://go.gwu.edu/pythoncampdata2', 'team-dataset-prepared.json')
with open('team-dataset-prepared.json') as f:
    my_team = json.load(f)

III.1 Computing averages#

We can compute an average by dividing some total quantity by the number of occurences of each quantity. For instance, let’s say we have a list prices defined as follows:

prices = [100.5, 75, 85.75, 90, 55]

We can compute the average price like this:

avg_price = sum(prices) / len(prices)

Note that the sum() function works only on numeric types (integers and floats).

Try it out!

Less Challenging

Each dictionary in your team dataset should include a key "years_at_gw". Calculate the average number of years at GW for members of your team.

More Challenging

Calculate the average cost of a textbook in the bookstore dataset.

# Your code here

Hint

For the less challenging version of the problem, you might take this approach:

Define a variable called total_years.
Use a for loop to iterate over the my_team list.
For each member in my_team (assuming member is your loop variable), add the value of member["years_at_gw"] to the total_years variable.
Outside the loop, divide the total_years variable by the length of the my_team list.

For the more challenging version, a similar approach could work, with these differences:

Each textbook’s price corresponds to the "price_display" key of a dictionary. These dictionaries are stored in a list nested under the dictionary corresponding to each course.
Because there are two levels of list – a list of courses, and under each course, a list of textbooks – you’ll probably want to use a nested for loop.
Each price is stored as a string, so you’ll want to convert it to a float before doing any math with it.
You’ll also need to keep track – perhaps in a separate variable – of the total number of textbooks. Note that len(bkst_data) will return the number of courses, not the number of textbooks.

III.2 Finding the maximum and minimum#

For this exercise, feel free to reuse and modify the code you write for exercise III.1.

Try it out!

Less challenging

Use your team dataset to identify the team member with the longest name (the string containing the most characters).

More challenging

Identify the most expensive and/or the least expensive textbook in the bookstore dataset.

# Your code here

Hint

This exercise is subtly different from the first. When finding the average, we were computing a single number. Here we want to find the data element that satisfies the criterion (e.g., max or minimum price, maximum length of name).

The following considerations may prove important:

We’ll need to keep track of both the maximum (or minumum) value we’ve seen so far (e.g., price, name length), and the textbook or team member associated with that value. One way to do that is to define two variables outside of the for loop, for instance:

max_price = 0
most_expensive_textbook = {}

Inside the for loop – or the inner for loop, if you’re working with the textbook data – you may want to use an if statement to compare the current value to the maximum value you’ve seen so far, and to update the latter accordingly.
You need to decide how to handle situations where there is more than one data element satisfying the maximum/minimum criterion. One option is to define the variable most_expensive_textbook (or longest_name) as a list, and then to use the list’s append() method to add a new element to it whenever one satisfies the criterion.

III.3 Restructuring Data#

When working with data, sometimes we’re not trying to find a single answer, like the average price of textbooks or the most expensive textbook. Rather, we want to transform our dataset to make it more useful in answering multiple questions.

In building your team dataset and analyzing the bookstore dataset, you’ve worked with lists of dictionaries, which is a fairly common data structure you’ll encounter in Python (and other languages). It’s probably the most intuitive way, for instance, to represent data from a spreadsheet or other tabular format.

The only problem with a list in Python is that as the list grows longer, finding individual elements becomes less efficient. In such cases, it might be more useful to store the data in a dictionary. Even in a dictionary of dictionaries, if your data happens to be nested!

Try it out!

Less challenging

Convert the team dataset into a Python dictionary, where each key is the team member’s name, and the associated value is the information you’ve collected about that team member.

More challenging

Imagine that you’re managing the bookstore’s data, and you get a request from GW Libraries for a dataset they can use to look up books by ISBN. (The ISBN is the International Standard Book Number; most commercially books published today have an ISBN, which serves as the book’s unique identifier.)

Create a Python dictionary out of all the textbooks in the bookstore dataset, where the keys are ISBN’s, and the values contain the information about each book. If a book lacks an ISBN, you can leave it out of the dicionary.

# Your code here

Hint

To visualize the transformation of the team dataset, the following visual model might be useful:

Diagram with two larger boxes, with three smaller nested inside each. Each smaller box contains labels and information asosciated with each label. The smaller nested boxes are labeled with numbers (0, 1, 2) in the first larger box, representing items in a list, and are labeled with names (Alex, Max, Josh) in the second, representing a dictionary. A large red arrow is pointing from the list box to the dictioanry box

You can accomplish this with a single for loop. For the bookstore dataset, the procedure is similar, but you will need to use a nested for loop (because textbooks are nested under courses).

Once you have created your new, dictionary-based dataset, you should be able to use it look up items by their identifier.

For instance, if my_team_dict is a dictionary of information about your team, you should be able to write my_team_dict["Alex"] to call up the information about a team member named Alex.

And if isbn_text_dict is a dictionary of textbooks stored by ISBN, you should be able to enter something like isbn_text_dict["9782370210371"] to retrieve the information about that particular book (Regarde les Lumieres mon Amour).

Wrap up#

Today you did the following:

Used Python dictionaries to store and update data about your team.
Used conditionals, for loops, and the “counter” pattern to answers basic questions about a dataset.
Transformed the structure of a dataset to facilitate a different kind of access.

In the homework tonight, you’ll practice some techniques for troubleshooting code when errors arise.

And tomorrow, for the final day of Python Camp, you’ll work with your team to design and write some code from scratch to address a user story that your team comes up with.

Programming with Data

Contents

Programming with Data#

Objectives#

I. What if?#

I.1 Conditionals & Boolean values#

I.2 if statements#

1.3 Iffy operations#

II. Loops & conditionals#

II.1 Counting with loops#

II.2 Counting with conditionals#

III. Working with data#

III.1 Computing averages#

III.2 Finding the maximum and minimum#

III.3 Restructuring Data#

Wrap up#

I.2 `if` statements#