From Code to Data#

Objectives#

  • To review the basic data types and syntactic patterns of Python.

  • To build more complex structures out of these basic data types.

  • To explore tools for cleaning and transforming data

Agenda & Instructions#

This notebook is intended for you to work through independently, in order to review and clarify the concepts introduced on Python Camp Day 1, and to lay the groundwork for the activities on Python Camp Day 2. However, feel free to collaborate with others in working through it. It is also intended to serve as a resource you can return to review as necessary.

In this homework, you will cover most of the fundamental tools of the Python programming language that you need to work with data. You’ve already encountered these tools in today’s team activities: Modeling Code and From Data to Code. But these homework exercises will unpack their syntax and explain how to use them. Along the way, you’ll cover the following concepts:

  • Working with numbers in Python

  • Working with text (strings) in Python

  • Slicing and splitting strings to extract textual data

  • Working with lists of numbers or strings

If you have little or no prior programming experience, you may find this homework challenging. But please try to work through all the exercises, even if you lose the thread or can’t make sense of the provided sample solutions. Just make a note of where you got stuck: we will review the homework and address your questions tomorrow in class.

How to Use this Notebook

  1. Read the documentation above each cell containing code and run the cell (Ctrl+Enter or Cmd+Return) to view the output.

  2. Follow the prompts labeled Try it out! that ask you to write your own code in the provided blank cells.

  3. (Hidden) solutions to these exercises follow the blank cells; click the toggle bar to expand the solution to compare with your approach.

  4. Some prompts include alternative exercises (Parsons Problems) that will be linked from the prompt. These alternatives may help clarify concepts (especially if you find yourself struggling to keep up with all the syntax).

  5. Optional annotations (labeled For the curious...) provide additional explanation and/or context for those who want them. Feel free to skip these sections if you like. As a beginner, it’s important to maintain a balanced cognitive load: taking in too much information all at once can impede your progress toward understanding. This balance looks different for everyone, but we have tried to keep the main content focused on a few key concepts, tools, and techniques, while providing that additional context for those who might benefit from it.

I. Numbers and text#

All computer programs, and all data consumed by computers, translate into sequences of electronic pulses that can be represented in binary code (1’s and 0’s).

But for our purposes – and the purposes of much programming – the most basic elements of data we deal with are numbers and text. That means that the data we’re actually interested in – the prices of textbooks, for instance, or the names of courses in the GW course catalog – will be represented in one of these two forms.

I.1 Working with numbers#

We can represent a single numeric value in Python – like the price of a textbook, or the enrollment in a course – by typing the number directly, without using quotation marks.

Here we create two variables to store two different numeric values.

book_price = 99.95
num_students = 55

Try it out!

Because the values are numeric, we can use these variables in calculations. In the cell below, create a new variable called total_cost that represents book_price multiplied by num_students. (Hint: in Python, we use the asterisk (*) to do multiplication.) Use the print() function to display the result.

# Your code here

Now check your answer by expanding the hidden solution cell below.

Hide code cell content
total_cost = book_price * num_students
print(total_cost)

I.2 Flavors of numbers#

In Python, numbers come in two main flavors: integers and floats. We can use Python’s type() function to expose the type of any value or variable.

type(book_price)
type(num_students)

A float is any number that contains a decimal point. An integer (int) is what we call a whole number.

Unlike some programming languages, Python handles the conversion between these two types automatically.

Question

Can you guess what type total_cost will be? Run the following cell to find out.

type(total_cost)

I.3 Don’t be scared of quotes#

Above we saw that Python allows us to perform calculations involving both integers and floats. Things are not quite so seamless with non-numeric types. Run the following code: instead of valid output, you should see an error message.

book_price = '$99.95'
tax = book_price * .1
print(book_price + tax)

The TypeError informs us that we can’t perform multiplication in this instance because book_price is not of the right type.

Notes

We saw prices represented this way – "$99.95" – in the bookstore dataset. The quotation marks indicate that this value is a string. As far as Python is concerned, anything between quotation marks is a string.

You can think of the quotation marks as a container: whatever you put into the container will be treated as an instance of the str type.

You can use either single ('') or double ("") quotation marks: Python doesn’t care. But in any given string, the quotation marks must match: a string that starts with a double quote and ends with a single quote will produce an error.

Try it out!

Create two variables to hold the title and author of a textbook. The title and author can be anything you like; the variable names should be book_title and book_author, respectively.

# Your code here

Now check your answer by expanding the hidden solution cell below.

Hide code cell content
book_title = 'Organic Chemistry'
book_author = 'John P. Bunsen'

I.4 Strings vs. variable names#

Strings in Python can consist of any sequence of Unicode characters between quotation marks. That includes letters, numbers, characters from non-Roman scripts (like Arabic or Chinese), even emojis. If you can type it on your keyboard, you can probably represent it in a Python string.

Python variable names, by contrast, are much more restrictive.

  1. They must begin with an alphabetic character (which can include characters from non-Roman alphabets).

  2. They cannot contain spaces.

  3. They may contain numerals (but not at the beginning of the name).

  4. The only permitted punctuation in a variable name is the underscore (_).

The following are some examples of valid and invalid Python variable names.

Name

Valid?

Reason

my_name

Yes

book_title_2

Yes

price$

No

Uses punctuation not allowed

2nd_book

No

Begins with a number

course year

No

Spaces not allowed

I.5 How to do things with strings: Slicing#

When working with textual data, keep in mind that Python doesn’t know anything about the meaning of what’s inside the quotes. It has no concept of words, punctuation, etc. – all the stuff that we as humans rely on to communicate effectively (elements of so-called natural languages).

A Python string is just a collection of characters. Imagine spelling out words in Scrabble, or with wooden alphabet blocks.

That said, Python strings are also suprisingly flexible. Python provides a lot of tools to make working with them easy, starting with the fact that each character in the string has a well-defined position, which we call the index. We can use the indices of characters to extract information from parts of a string.

The following code defines two string variables that hold information about a course and a term.

course = 'CHEM 1002 10'
term = 'Summer 2023'

What if we want to extract the department code, the course number, and the section number from the course variable?

By counting characters, we can see the following:

  • The department code occupies the first four (4) index positions. With strings, the first position is labeled 0, not 1, so the first 4 characters would fall in positions 0, 1, 2, and 3.

  • The course number occupies four more positions, but we also have to account for the intervening space: 4 (the space), then 5, 6, 7, 8.

Table showing the characters in the string "CHEM 1002 10" arranged in sequence on one row, with the numbers 0 through 11 on the row above.

We can use this information to slice our course variable as follows:

dept_code = course[0:4]         # Positions 0 through 3
course_num = course[5:9]        # Positions 5 through 8
print(dept_code)
print(course_num)

Notes

Note that in slicing a string, we provide – inside square brackets – the first index (counting from 0, not 1), followed by a colon, followed by the last index plus one. The colon means up to BUT NOT including.

Try it out!

Create a variable called section_num to hold the two-digit section number at the end of the course string, and print the variable.

# Your code here

Now check your answer by expanding the hidden solution cell below.

Hide code cell content
section_num = course[10:12]
print(section_num)

Because it’s common to want to slice off the first part of a string and take the rest of it up to the end, we could actually write the above as follows, leaving off the number after the colon:

section_num = course[10:]

The [10:] means, start at the 11th position and extract everything up to AND including the last character of the string.

What if we don’t know the exact position of the characters we want to extract?

Look at the term variable as defined above.

term = 'Summer 2023'

Let’s say that we expect the term data to consist of the name for a particular semester – Summer, Spring, or Fall – plus a 4-digit year.

Do you see the problem here? The name of the term can be either 4 characters long ("Fall") or 6 ("Summer", "Spring").

Fortunately, Python has us covered, because it lets us count backwards as well as forwards when slicing a string. We just use negative numbers! Since there’s no such number as -0, the last characer in a string has the position -1.

Table showing the characters in the string "Summer 2023" arranged in sequence on one row, with the numbers -11 through -1 on the row above.

To extract the last 4 characters (the year) from the term variable, we can use this slice:

term_year = term[-4:]
print(term_year)

We can also mix negative and positive positions in slicing. To start from the first character and slice up to (but not including) the fifth character from the end, we could use the following slice:

term_name = term[0:-5]
print(term_name)

I.6 How to do things with strings: Splitting#

Strings in Python come with a lot of built-in functionality. One of the most useful is a string method caled split().

In Python, a method is a set of predefined behaviors associated with a particular data type. We refer to split() as a “string method” because it’s available on anything that Python recognizes as a string.

You can think of a method as an appliance attachment: a vaccuum cleaner and a kitchen stand mixer can both come with various attachments; the vaccuum attachments help you use your vaccuum better, and your mixer attachments expand the functionality of the mixer. The same can be said of Python methods: they help us make better use of their associated data types.

Take a look at what happens when we call the split() method on our course variable.

course.split()

Let’s pause here and note a few things:

  1. There is a period (.) between course and split(). This indicates that the split() method belongs to our course variable. We get access to this method whenever we use a string value.

  2. The parentheses after the word split are required. We’ve seen parentheses in calling the print() and type() functions, too. Here the parens are empty because we’re not providing any arguments to split(). (Python knows what we want to split because of the period attaching the method to the course variable.)

  3. The output from course.split() is a list of three separate strings. A Python list is enclosed with square brackets and contains items separated by commas.

Try it out!

Use split() on the term variable (defined above) and compare the output with that of course.split().

Can you tell how split() works? How does it know where to separate the string?

# Your code here

Now check your answer by expanding the hidden solution cell below.

II. Strings & lists#

We have seen that when we call the split() method on a string, Python turns the string into a list of strings.

With this conversion, we get some added structure. A string can hold any sequence of valid Unicode characters; a list can hold any sequence of valid Python values. Even other lists!

In the team activity with the bkst_data dataset, you encountered some ways to work with lists. The following reviews those and adds a few more.

II.1 Accessing items#

We can use an integer inside square brackets to access the item at single position (index) within a list.

course_info = course.split()
print(course_info[0])

Just as with strings, negative indexing works, too. The -1 index gives us the last item in the list.

print(course_info[-1])

Try it out!

Slicing works, too.

Use the slicing syntax you learned above to extract the first two items from the course_info variable.

# Your code here

Now check your answer by expanding the hidden solution cell below.

Hide code cell content
print(course_info[0:2])

Notes

If you didn’t get two items, remember that in slicing, the number on the left side of the colon represents the index we want to start with, and the number on the right side represents the index one after the index we want to end with.

Wrapping up#

Congratulations! This homework covered a lot of material.

  1. We learned about various Python types, which allow us to represent (computationally) raw data in various ways.

  2. We saw that different types have different uses – or we might say, different behaviors: we can do addition and multiplication with integers and floats, we can split and slice strings, and we can also slice lists.

  3. With integers, floats, strings, and lists, we can capture complex data using the tools of the Python language. In the team exercise From Data to Code, we saw how a dataset in the JSON format translates into these Python types (along with one additional type, the dictionary, which you will meet in tomorrow’s activity.)