Extra Practice for Python Camp#

Agenda & Instructions#

This notebook provides a series of exercises designed as extra practice for the material covered in the daily lessons and the homework. You are by no means required to work through these exercises, but doing so may prove helpful in a couple of situations:

  • If you want to review a particular concept in order to solidify your understanding.

  • If you want to build more “muscle memory” by writing code beyond the opportunities provided by the group activities and homework.

The exercises below are arranged by concept, largely following the order in which the concepts were presented in the course materials. The code in each section is self-contained, so feel free to jump around and/or pick only certain sections to do.

Answers to the challenges in each section are provided in a collapsed (Solution) cell at the bottom of the section.

How to Use this Notebook

  1. Read the documentation above each cell containing code and run the cell (Ctrl+Enter or Cmd+Return) to view the output.

  2. Follow the prompts labeled Try it out! that ask you to write your own code in the provided blank cells.

  3. (Hidden) solutions to these exercises follow the blank cells; click the toggle bar to expand the solution to compare with your approach.

  4. Some prompts include alternative exercises (Parsons Problems) that will be linked from the prompt. These alternatives may help clarify concepts (especially if you find yourself struggling to keep up with all the syntax).

  5. Optional annotations (labeled For the curious...) provide additional explanation and/or context for those who want them. Feel free to skip these sections if you like. As a beginner, it’s important to maintain a balanced cognitive load: taking in too much information all at once can impede your progress toward understanding. This balance looks different for everyone, but we have tried to keep the main content focused on a few key concepts, tools, and techniques, while providing that additional context for those who might benefit from it.

Integers, floats & strings#

Li X. is a graduate student working in a bioinformatics lab, and he is learning how to process protein and DNA sequences using Python. His first task is to compute the lengths of various sequences, three examples of which are given below. Write some code to compute the length of each of these sequences.

seq1 = "LYLIFGAWAGMVGTALSLLIRAELGQPGTLLGDDQIYNVIVTAHAFVMIFFMVMPIMIGGFGNWLVPLMIGAPDMAFPRMNNMSFWLLPPSFLLLLASSTVEAGAGTGWTVYPPLAGNLAHAGASVDLAIFSLHLAGVSSILGAINFITTAINMKPPTLSQYQTPLFVWSVLITAVLLLLSLPVLAAGITMLLTDRNLNTTFFDPAGGGDPVLYQHLFWFFGHPEVYILIL"
seq2 = "VGTALXLLIRAELXQPGALLGDDQIYNVVVTAHAFVMIFFMVMPIMIGGFGNWLVPLMIGAPDMAFPRMNNMSFWLLPPSFLLLMASSTVEAGAGTGWTVYPPLAGNLAHAGASVDLAIFSLHLAGISSILGAINFITTAINMKPPALSQYQTPLFVWSVLITAVLLLLSLPVLAAGITMLLTDRNLNTTFFDPAGGGDPVLYQHLFWFFGHPEVYILIL"
seq3 = "WAGMVGTALSLLIRAELGQPGALLGDDQIYNVVXTAHAFVMIFFMVMPIMIGGFGNWLVPLMIGAPDMAFPRMNNMSFWLLPPSFLLLMASSTVEAGVGTGWTVYPPLAGNLAHAGASVDLAIFSLHLAGISSILGAINFITTAINMKPPALSQYQTPLFVWSVLITAVLLLLSLPVLAAGITMLLTDRNLNTTFFDPAGGGDPVLYQHLFWFFGHPEVYILIL"
#Your code here

Now Li needs to find the average length of a collection of sequences, seq1 through seq3 above, in addition to the following two sequences. Write some code to find the average length of all five sequences.

seq4 = "VGTALSLLIRAELGQPGTLLGDDQIYNVIVTAHAFVMIFFMVMPVMIGGFGNWLVPLMIGAPDMAFPRMNNMSFWLLPPSFLLLLASSTVEAGAGTGWTVYPPLAGNLAHAGASVDLAIFSLHLAGVSSILGAINFITTAINMKPPALSQYQTPLFVWSVLITAVLLLLSLPVLAAGITMLLTDRNLNTTFFDPAGGGDPVLYQHLFWFFGHPEVYILIL"
seq5 = "LYLIFGAWAGMVGTALSLLIRAELGQPGALLGDDQVYNVVVTAHAFVMIFFMVMPIMIGGFGNWLVPLMIGAPDMAFPRMNNMSFWLLPPSFLLLLASSTVEAGVGTGWTVYPPLAGNLAHAGASVDLAIFSLHLAGISSILGAINFITTAINMKPPALSQYQTPLFVWSVLITAVLLLLSLPVLAAGITMLLTDRNLNTTFFDPAGGGDPVLYQHLFWFFGHPEVYILIX"
#Your code here

The lab director is pleased with Li’s progress and gives him another task. The lab is trying out some new software, and this software requires that each sequence used as input be prefixed with a number representing the length of the sequence, followed by a space. So for seq1 above, the input would look as follows (the sequence has been abbreviated for display purposes):

231 LYLIFGAWAGMVGTALSLL....

Write some code to add the sequence length (and a space) to the beginning of each of the sequences given above.

#Your code here

So far Li really likes working with Python; he can see how it will make his work in the lab a lot more efficient. One thing that confuses him, however, is that whole numbers in Python are represented sometimes as one-place decimals, e.g., 30.0, and sometimes without the decimal part, e.g., 30. Surely Python isn’t being arbitary? Comparing the output of the following two lines of code, what can you tell about the kinds of situations in which the decimal part appears when dealing with whole numbers?

len(seq1)
len(seq2) / 10

Working with strings (1)#

Alia M. is a social scientist working with data on residential housing patterns. She has a large datafile of residential properties in the United States, consisting of street addresses and some information about the type of housing at each address. For each row in the datafile, the first part consists of a string that represents the property’s street address, including city, state abbreviation, and zip code. The address is preceded by a seven-digit unique identifier. One such string is given below.

address1 = "AGF5670 2123 N. 3rd St., St. Cloud, MN 56301"

Alia’s first task is to extract the unique identifier (AGF5670 in the example above) from each string. How can she use a string slice to accomplish this task? Write some code that extracts the identifier from the string stored in the address1 variable.

#Your code here

Next Alia would like to extract the five-digit zip code from the address string. (The zip code is 56301 in the example provided.) What code can she use to do that? (Assume that all zip codes in the datafile consist of five digits.)

#Your code here

Great work! This code is really going to make Alia’s life easier – no more manual data entry for her! The next challenge, as you might guess, is to extract the US postal code for the state. Assume that the addresses in this datafile all have a two-letter abbreviation (MN, for Minnesota, in the example above).

#Your code here

Finally, Alia needs to extract the street address itself, together with the city name: e.g., 2123 N. 3rd St., St. Cloud. The length of this portion of the address varies throughout the datafile, so ideally, your code would work for any string fitting the above pattern, regardless of length.

#Your code here

To make sure your code works properly on addresses of different lengths, try extracting the different parts of address2 below.

address2 = "WGA9753 412 Mockingbird Lane, Crowley, LA 70506"
#Your code here

Working with strings (and lists)#

Social scientist Alia has some additional data points in her file that she wishes to extract. These data consist of two alphanumeric codes and a four-digit year. The first element indicates the type of dwelling, the second indicates whether the property is a rental or occupant-owned, and the third indicates the year in which the property was built. The following example shows the string for a single-family rental property built in 1986.

address_info1 = "SF,R,1986"

Alia is feeling pretty confident about Python string slicing, so her first approach is to write three slice expressions to extract the information between the commas in the string above. Can you write those expressions below, assigning each part of the string (without the commas) to a new variable?

#Your code here

Alia is pleased with herself until she realizes that not all of the codes in the datafile are of the same length. The code in the first position can be either “SF” (single-family), “MF” (multi-family), or “Mixed” (for properties with a mix of residential and commercial spaces). Likewise, the second code can be either “R” (rental), “O” (owned), or “n/a” (a null value, used when the information was not available).

Since two of the three elements in this string are of variable lengths, it’s not easy to write slice expressions that will extract all three elements. Fortunately, Alia has learned about the str.split() method. Write some code that will split the address_info1 string into three parts, and assign each part to a new variable.

#Your code here

Now verify that your approach works for strings with elements of different lengths: address_info2 and address_info3 below.

address_info2 =  "MF,n/a,1996"
#Your code here
address_info3 = "Multi,R,2007"
#Your code here

“That slice method is pretty cool!” Alia thinks. She wonders whether it would be a good idea to split the address strings as well (see the introduction to Working with Strings (1)), instead of slicing them.

Given the address1 variable below, what happens if you split it on the white space? Do you think this would be a good approach for separating out the elements of the string: the 7-digit identifier, the street address and city name (of variable length), the two-letter state abbreviation, and the five-digit zip code? Why or why not?

address1 = "AGF5670 2123 N. 3rd St., St. Cloud, MN 56301"
#Your code here

Using the result of splitting address1, can you extract the identifier, state abbreviation, and zip code?

#Your code here

Bonus Now Alia is having trouble working with the street address and city name in the result of splitting address1. A colleague mentioned the str.join() method, which does the inverse operation of str.split().

Can you use list slicing and the join method to produce a single string corresponding to the street address and city name in address1, starting from the result of splitting the latter?

Working with lists#

Priya J.’s office at the university has surveyed students about the need for additional resources and support for learning programming languages. Having recently learned about Python herself, Priya would like to use it to analyze the survey results. She has managed to load the survey data from a CSV file using Python. Each row of the survey responses is a Python list.

One such list is shown below. The elements, in order from left to right, comprise the following data points:

  • Respondent’s email address

  • Department code (of the respondent’s major or field of study)

  • Respondent’s status (undergraduate or graduate student)

  • Respondent’s degree (BA, MA, PhD, etc.)

  • Respondent’s expected year of graduation

  • A string representing one or more programming languages the respondent has studied, separated by commas

  • Five numbers corresponding to Likert-scale responses to a series of questions (each response being a number between 1 and 10)

response1 = ["george.washington@gwu.edu", "Engl", "UND", "BA", 2025, "Python,R", 5, 6, 4, 7, 8]

From Python Camp, Priya knows that she can use indexing to access individual elements in a list, but she’s a little fuzzy on the details. Can you help her out? In the cells below, write some code to access the following elements:

  • The respondent’s email address

  • The respondent’s status ("UND" in the example provided)

  • The respondent’s expected year of graduation

#Your code here
#Your code here
#Your code here

Priya needs to perform some data cleanup on the responses, as well as a preliminary analysis. First, she wants to anonymize the results, replacing each email address with a unique, random identifier. The standard Python installation includes a handy library for generating such identifiers called uuid. In the following code, Priya imports the uuid1() function, which will generate a single identifier.

from uuid import uuid1

In the cell below, write some code that assigns the result of calling the uuid() function to the first position in the response1 list, replacing the email address.

If your code works as intended the response1 list should look something like the following:

[UUID('ed603b16-a8c7-11ee-90be-ee5daa8b5d7a'), 'Engl', 'UND', 'BA', 2025, 'Python,R', 5, 6, 4, 7, 8]
#Your code here

Priya notices that the values for the possible department codes (the second element in the response1 list) were not coded consistently on the survey form. Some of them are given in uppercase (e.g., "Engl"), while some are given in title case ("CHEM"). To fix this problem, Priya decides to use the str.upper() method to convert every department code to uppercase.

Write some code to replace the second element of the response1 list, which is a string, with its uppercase version, using the .upper() method available on the string. (For example, if dept1 is a variable holding a string, dept1.upper() will return a new string composed of converting the original string to uppercase.)

#Your code here

Now onto some analysis. Priya would like compute the average of the five response scores (the last five elements of the list) and add this value to the end of the list. That way, each row of responses will contain its average score as its final element.

Create a new variable, average1, and assign it to the average of the five scores in response1. Hint: Priya has learned about the sum() function, which takes as its argument a list of numbers and returns their sum. To find the average score, divide the sum of scores by the number of scores (5).

#Your code here

Bonus. Now Priya needs to add the value in average1 to the end of the list. She remembers learning about the list.append() method; when called with an argument on a Python list, .append() will add its argument to the end of the list. When Priya runs the following code, it produces no errors.

response1 = response1.append(average1)

But then when she runs print(response1), instead of displaying her list with the average included, she gets None. What happened to her list?

Can you tell what went wrong? How should Priya append average1 to the response1 list?

#Your code here

Working with dictionaries#

Priya has tried a different method of loading the survey responses from the CSV file, such that each row of responses is now a Python dictionary. The following cell defines one such dictionary.

r1_dict = {"email": "martha.washington@gwu.edu", 
             "department": "BISC", 
             "status": "GRAD", 
             "degree": "PhD", 
             "year": 2028, 
             "languages": "Python,R,C++", 
             "q1": 3, 
             "q2": 4, 
             "q3": 6, 
             "q4": 2, 
             "q5": 3} 

In the cell below, write some code to access the values associated with the "degree", "languages" and "q1" keys. Assign each value to a new variable.

#Your code here

The "languages" key is associated with a string representing one or more languages studied by the respondent. (If the respondent has studied no programming languages, the string “n/a” is used as a null value.) When multiple languages are present, they are separated by commas. Priya would like to convert this string to a list.

Write some code to split() the value of the "languages" key in r1_dict on the commas, and assign the result back to the same key.

#Your code here

Previously, Priya calculated the average of all five question scores and appended the result to the end of the list that represented the survey response. Since she’s using a dictionary now, she can add the average as its own key.

Write some code to calculate the average of the values associated with "q1" through "q5" and add that result to the r1_dict dictionary, using a key called "average_score".

#Your code here

Loops & lists: Summarizing a list#

In Integers, Floats, & Strings, bioinformatics student Li calculated the average length of a number of protein sequences, where each sequence was represented by a separate Python string variable (seq1, seq2, seq3, etc.). Li’s colleague pointed out that this approach will become cumbersome when dealing with large amounts of data: typically, the number of sequences to be analyzed runs to the hundreds or thousands, so it’s not feasible to create a separate variable to hold the string representing each sequence.

Instead, Li’s colleague proposes that he use a Python list to hold the sequences (a list of strings), and a for loop to do the calculation. After a little research, Li understands that he can use the loop to add up the lengths of all the sequences in the list, and he can divide the total length (of sequences) by the length of the list itself (the number of sequences).

The following code defines such a list of sequences:

sequences = ["FQTWEEFSRAAEKLYLADPMKVRVVLKYRHVDGNLCIKVTDDLVCLVYRTDQAQDVKKIEKF", 
            "KYRTWEEFTRAAEKLYQADPMKVRVVLKYRHCDGNLCIKVTDDVVCLLYRTDQAQDVKKIEKFHSQLMRLME LKVTDNKECLKFKTDQAQEAKKMEKLNNIFFTLM",
            "EEYQTWEEFARAAEKLYLTDPMKVRVVLKYRHCDGNLCMKVTDDAVCLQYKTDQAQDVKKVEKLHGK",
            "MYQVWEEFSRAVEKLYLTDPMKVRVVLKYRHCDGNLCIKVTDNSVCLQYKTDQAQDVK",
            "EEFSRAVEKLYLTDPMKVRVVLKYRHCDGNLCIKVTDNSVVSYEMRLFGVQKDNFALEHSLL",
            "SWEEFAKAAEVLYLEDPMKCRMCTKYRHVDHKLVVKLTDNHTVLKYVTDMAQDVKKIEKLTTLLMR",
            "FTNWEEFAKAAERLHSANPEKCRFVTKYNHTKGELVLKLTDDVVCLQYSTNQLQDVKKLEKLSSTLLRSI",
            "SWEEFVERSVQLFRGDPNATRYVMKYRHCEGKLVLKVTDDRECLKFKTDQAQDAKKMEKLNNIFF",
            "SWDEFVDRSVQLFRADPESTRYVMKYRHCDGKLVLKVTDNKECLKFKTDQAQEAKKMEKLNNIFFTLM",
            "KNWEDFEIAAENMYMANPQNCRYTMKYVHSKGHILLKMSDNVKCVQYRAENMPDLKK",
            "FDSWDEFVSKSVELFRNHPDTTRYVVKYRHCEGKLVLKVTDNHECLKFKTDQAQDAKKMEK"]

Note that in the list above, a single variable – sequences – is defined to hold the entire list of strings. To access each string in the list, we can either use indexing (see Working with Lists, or we can write a loop.

The loop variable in a for loop gives us access to each element in the list in order. For example, in the code below, the seq variable would first get assigned the string beginning with FQTWEEF, then the string beginning with KYRTWEEF, and so on, for every string in the list:

for seq in sequences:

For loops are a very powerful construct for dealing with a large amount of data that needs to be processed in the same way.

In this case, the task is to add up the lengths of all the sequences. Li realizes that in addition to the loop variable, he’ll need to use another variable to keep track of the running total of lengths. Such a variable is defined below.

total_length = 0

Write a loop for Li that will add the length of each string in sequences to the total_length variable.

#Your code here

Now using total_length, find the average length of sequences in the list. (Hint: you can find the number of elements in a list by taking its length.)

#Your code here

Bonus Can you modify the loop code from above in order to find the length of the longest sequence in the sequences list? Hint: Python has a built-in max() function that, when given two arguments, will return the larger of the two: e.g., max(1, 2) returns 2.

#Your code here

Loops & lists: Transforming a list#

Social scientist Alia would like to extract the seven-digit unique identifiers from a list of addresses, where each address is a string starting with the identifer.

The following code defines such a list.

addresses = ["UP63659 93 North 9th Street, Brooklyn, NY 11211",
            "LH06225 380 Westminster St, Providence, RI 02903",
            "AJ30648 177 Main Street, Littleton, MA 03561",
            "QL46276 202 Harlow St, Bangor, ME 04401",
            "QQ72122 46 Front Street, Waterville, ME 04901",
            "PY84657 22 Sussex St, Hackensack, NJ 07601",
            "CJ55552 75 Oak Street, Patchogue, NY 11772",
            "FU76399 1 Clinton Ave, Albany, NY 12207",
            "ZK38387 7242 Route 9, Plattsburgh, NY 12901",
            "JF67659 520 5th Ave, Mckeesport, NY 15132",
            "KJ13496 122 W 3rd Street, Greensburg, NY 15601",
            "HW78470 901 University Dr, State College, PA 16801",
            "PL44372 240 W 3rd St, Williamsport, PA 17701",
            "NJ95036 41 N 4th St, Allentown, PA 18102"]

Alia knows that she can extract the identifier from each string by using a slice, such that if the variable addr holds the string "UP63659 93 North 9th Street, Brooklyn, 11211", then the expression addr[:7] will give the string "UP63659".

How can Alia write a for loop to perform this slice for every string in the addresses list, storing the results (the identifiers) in a separate list called identifiers?

(She knows that the first step is to define the new identifiers list, which is empty to start with.)

identifiers = []

(She also knows that she needs to use the .append() method, which is available on any Python list, to add each identifer to the identifiers list.)

#Your code here

Nested data 1: Loops, lists, & dictionaries#

Data analyst Priya needs to calculate the average response score for each of five questions on her survey of GW students and programming support. As described in Working with Dictionaries, Priya has loaded a datafile using Python such that each survey response is represented by a dictionary. The complete set of responses is represented by a list of dictionaries, with each dictionary having the same keys but different values. A sample of the dataset is shown below.

responses = [{"id": "27a20f28-ab40-11ee-ac72-acde48001122",  "department": "GEOG",  "status": "GRAD",  "degree": "MA",  "year": 2025,  "languages": "Python,R",  "q1": 7,  "q2": 6,  "q3": 1,  "q4": 1,  "q5": 6},
 {"id": "27a2100e-ab40-11ee-ac72-acde48001122",  "department": "GEOG",  "status": "UND",  "degree": "BA",  "year": 2026,  "languages": "Python",  "q1": 1,  "q2": 4,  "q3": 1,  "q4": 9,  "q5": 3},
 {"id": "27a21068-ab40-11ee-ac72-acde48001122",  "department": "CHEM",  "status": "GRAD",  "degree": "PhD",  "year": 2025,  "languages": "Python,R,C++",  "q1": 2,  "q2": 7,  "q3": 5,  "q4": 3,  "q5": 4},
 {"id": "27a210a4-ab40-11ee-ac72-acde48001122",  "department": "ECON",  "status": "UND",  "degree": "BA",  "year": 2024,  "languages": "R",  "q1": 7,  "q2": 4,  "q3": 8,  "q4": 5,  "q5": 2},
 {"id": "27a210e0-ab40-11ee-ac72-acde48001122",  "department": "PSYC",  "status": "UND",  "degree": "BA",  "year": 2028,  "languages": "Python",  "q1": 1,  "q2": 1,  "q3": 4,  "q4": 8,  "q5": 9}]

For each response, the numeric scores for the five questions are stored under the keys "q1" through "q5". Can you write some code to calculate the average score for each question, across all five responses in the responses list?

Hint: you’ll need to use a for loop, combined with one or more variables, defined outside the loop, to store the running totals for each question’s scores.

#Your code here

Priya is interested in possible correlations between respondents’ answers to the five survey questions and the number of programming languages the respondents have studied.

The response data include, for each respondent, a string that consists of one or more programming languages, separated by commas. Priya knows that by using the str.split() method with an argument of ","(see Working with Dictionaries), she can obtain a list of programming languages for each respondent, and she reasons that by taking the length of that list, she can obtain an integer representing the number of languages reported by that respondent.

Can you write some code to add a "num_languages" key to each dictionary in the responses list, where the value of the key corresponds to the number of languages listed under the "languages" key?

#Your code here

Bonus: If you find that your code for computing the averages seems rather redundant, with multiple lines doing almost the same thing, can you identify any ways to reduce the redunancy? (Shorter code is necessarily better code, but situations where code gets repetitive do tend to multiply opportunities for programmer error.)

#Your code here

Nested data 2: Creating a list of dictionaries#

In analyzing nucleotide sequences, it’s common to count the number of bases of each type – A, G, C, T – that appear in the sequence. A list of such sequences is defined below, where each sequence is a string consisting of the same four characters but in different combinations.

sequences = ["ggtaagtcctctagtacaaacacccccaatattgtgatataattaaaattatattcatattctgttgccagaaaaaacacttttaggctatattagagccatcttctttgaagcgttgtc",
             "ggtaagtcctctagtacaaacacccccaatattgtgatataattaaaattatattcatattctgttgccagaaaaaacacttttaggctatattagagccatcttctttgaagcgttgtc",
             "ggtaagtgctctagtacaaacacccccaatattgtgatataattaaaattatattcatattctgttgccagattttacacttttaggctatattagagccatcttctttgaagcgttgtctatgcatcgatcgacgactg",
             "tgcaccaaacatgtctaaagctggaaccaaaattactttctttgaagacaaaaactttcaaggccgccactatgacagcgattgcgactgtgcagatttccacatgtacctgagccgctg",
             "caactccatcagagtggaaggaggcacctgggctgtgtatgaaaggcccaattttgctgggtacatgtacatcctaccccggggcgagtatcctgagtaccagcactggatgggcctcaa"]

Li wants to analzye the sequences in the sequences list. For each sequence, he needs to record how many times each base occurs in the sequence. Since each nucleotide consists of the same four bases, one approach would be to create a list of dictionaries, where each dictionary would contain a key for each base. For example, a single sequence might be represented by the following dictionary:

{"a": 24,
"c": 55,
"g": 33,
"t": 28}

The values in the dictionary record the number of occurences of each base. The following is Li’s logical plan for implementation:

  1. Define a new, empty list to hold the base counts for each sequence..

  2. Loop over the list of sequences.

  3. For each sequence, define a new dictionary to hold the base counts, initializing each base to 0.

  4. Within the outer loop, loop over the letters (the bases) in the sequence itself. (Looping over a string gives each character in turn.)

  5. For each base, increment the key corresponding to that base by 1.

  6. Add the dictionary of base counts to the list.

Can you write some code to implement Li’s plan?

#Your code here

Nested data 3: Loops and conditionals#

Using Python, Alia has transformed her dataset into a list of dictionaries, such that each dictionary contains data about a particular residential property, including its address and the year in which the dwelling was built. The dataset now looks like the following example:

address_data = [{"street_address": "93 North 9th Street",
  "city": "Brooklyn",
  "state": "NY",
  "zip": "11211",
  "year_built": 1995},
 {"street_address": "380 Westminster St",
  "city": "Providence",
  "state": "RI",
  "zip": "02903",
  "year_built": 1979},
 {"street_address": "177 Main Street",
  "city": "Littleton",
  "state": "MA",
  "zip": "03561",
  "year_built": 1992},
 {"street_address": "202 Harlow St",
  "city": "Bangor",
  "state": "ME",
  "zip": "04401",
  "year_built": 1964},
 {"street_address": "46 Front Street",
  "city": "Waterville",
  "state": "ME",
  "zip": "04901",
  "year_built": 2007},
 {"street_address": "22 Sussex St",
  "city": "Hackensack",
  "state": "NJ",
  "zip": "07601",
  "year_built": 1983},
 {"street_address": "75 Oak Street",
  "city": "Patchogue",
  "state": "NY",
  "zip": "11772",
  "year_built": 1996},
 {"street_address": "1 Clinton Ave",
  "city": "Albany",
  "state": "NY",
  "zip": "12207",
  "year_built": 1961},
 {"street_address": "7242 Route 9",
  "city": "Plattsburgh",
  "state": "NY",
  "zip": "12901",
  "year_built": 1932},
 {"street_address": "520 5th Ave",
  "city": "Mckeesport",
  "state": "NY",
  "zip": "15132",
  "year_built": 1958},
 {"street_address": "122 W 3rd Street",
  "city": "Greensburg",
  "state": "NY",
  "zip": "15601",
  "year_built": 1976},
 {"street_address": "901 University Dr",
  "city": "State College",
  "state": "PA",
  "zip": "16801",
  "year_built": 1939},
 {"street_address": "240 W 3rd St",
  "city": "Williamsport",
  "state": "PA",
  "zip": "17701",
  "year_built": 1934},
 {"street_address": "41 N 4th St",
  "city": "Allentown",
  "state": "PA",
  "zip": "18102",
  "year_built": 1981}]

Alia would like to do some statistics on the ages of the dwellings in her dataset, and she’d like to group the analysis by state. Ideally, she like to be able to select a state and see a list of the years in which dwellings were built. So using the example above, "PA" would correspond to 1939, 1934, and 1981.

A reasonable structure would be a dictionary in which the keys are strings corresponding to the state abbreviations, and the values are lists of years. The entry for Pennsylvania would look as follows:

{...,
 "PA": [1939, 1934, 1981],
 ...}

Her logical plan is as follows:

  1. Define an empty dictionary to hold the new data; call it years_by_state.

  2. Loop over the address_data list, each element of which should be a dictionary with a "state" key and "year_built" key.

  3. Get the value of the "state" key from the current address (the loop variable).

  4. This value becomes the key to use with the years_by_state dictionary.

  5. Get the value of the "year_built" key from the current address (the loop variable).

  6. This becomes the value to add to the years_by_state dictionary. But since we want to associate multiple years with each state abbreviation, we should append them to a list, rather than simply assigning the value to the key (which would in fact overwrite any previous value associated with that key).

Alia has written the following code, but she keeps getting a KeyError. Can you help her fix it?

years_by_state = {}
for addr in address_data:
    state_key = addr["state"]
    year_value = addr["year_built"]
    years_by_state[state_key].append(year_value)
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Cell In[73], line 5
      3 state_key = addr["state"]
      4 year_value = addr["year_built"]
----> 5 years_by_state[state_key].append(year_value)

KeyError: 'NY'

Hint: Python will not allow you to use the append() method on a list that hasn’t been defined. Since each key in years_by_state is associated with a list, the first time you add a key to the dictionary, you need to define the list. To test whether a key is already in a dictionary, you can use an if statement with a Boolean expression. The following expression evaluates as True if the key "NY" is already in the dictionary years_by_state and as False otherwise:

"NY" in years_by_state
#Your code here

Loops & conditionals: Categorizing data#

Pryia is analyzing survey data wherein each respondent has indicated the programming languages they have studied. In the following extract of her datafile, each response is associated with a unique, anonymous identifier and a list of languages. If the repondent indicated no languages, the list is empty.

responses = [{"id": "c9e703e0-ad9a-11ee-ac72-acde48001122",
              "languages": ["R", "Java", "Python"]},
             {"id": "c9e7089a-ad9a-11ee-ac72-acde48001122",
                  "languages": []},
             {"id": "c9e70976-ad9a-11ee-ac72-acde48001122", 
              "languages": []},
             {"id": "c9e70a16-ad9a-11ee-ac72-acde48001122", 
              "languages": ["R"]},
             {"id": "c9e70a8e-ad9a-11ee-ac72-acde48001122", 
              "languages": ["Python", "C/C++"]},
             {"id": "c9e70b06-ad9a-11ee-ac72-acde48001122", 
              "languages": ["Other"]},
             {"id": "c9e70b74-ad9a-11ee-ac72-acde48001122", 
              "languages": ["R", "Python"]},
             {"id": "c9e70bec-ad9a-11ee-ac72-acde48001122",
              "languages": ["R", "Python"]},
             {"id": "c9e70c64-ad9a-11ee-ac72-acde48001122", 
              "languages": ["Python"]},
             {"id": "c9e70cc8-ad9a-11ee-ac72-acde48001122", 
              "languages": ["R"]}]

In order to simplify her analysis, Priya has decided to code her data as follows:

  • If the respondent indicated more than one language, the response is coded as a 3.

  • If the respondent indicated only one language, but that language is either Python or R, the response is coded as a 2.

  • If the respondent indicated no language, or they indicated a single language other than Python or R, the response is coded as a 1.

Her logical plan is as follows:

  1. Create a new list to hold the coded data.

  2. Loop over the responses list; for each response, create a new dictionary to hold the ID of the current response and its language code.

  3. Use an if-elif-else statement to evaluate the data associated with the "languages" key.

  4. Based on which branch of the if statement is True, assign a value to a "language_code" key in the current response dictionary, and append the latter to the new list.

Can you write some code to implement Priya’s plan?

Hint: To test whether a string is present in a list, use an expression with the keyword in. For instance, if the variable lang_list holds a list of string, to check if the string "Python" appears in lang_list, we can write:

if "Python" in lang_list:
#Your code here