Extra Practice for Python Camp#
Agenda & Instructions#
This notebook provides a series of exercises designed as extra practice for the material covered in the daily lessons and the homework. You are by no means required to work through these exercises, but doing so may prove helpful in a couple of situations:
If you want to review a particular concept in order to solidify your understanding.
If you want to build more “muscle memory” by writing code beyond the opportunities provided by the group activities and homework.
The exercises below are arranged by concept, largely following the order in which the concepts were presented in the course materials. The code in each section is self-contained, so feel free to jump around and/or pick only certain sections to do.
Answers to the challenges in each section are provided in a collapsed (Solution) cell at the bottom of the section.
How to Use this Notebook
Read the documentation above each cell containing code and run the cell (
Ctrl+Enter
orCmd+Return
) to view the output.Follow the prompts labeled
Try it out!
that ask you to write your own code in the provided blank cells.(Hidden) solutions to these exercises follow the blank cells; click the toggle bar to expand the solution to compare with your approach.
Some prompts include alternative exercises (Parsons Problems) that will be linked from the prompt. These alternatives may help clarify concepts (especially if you find yourself struggling to keep up with all the syntax).
Optional annotations (labeled
For the curious...
) provide additional explanation and/or context for those who want them. Feel free to skip these sections if you like. As a beginner, it’s important to maintain a balanced cognitive load: taking in too much information all at once can impede your progress toward understanding. This balance looks different for everyone, but we have tried to keep the main content focused on a few key concepts, tools, and techniques, while providing that additional context for those who might benefit from it.
Integers, floats & strings#
Li X. is a graduate student working in a bioinformatics lab, and he is learning how to process protein and DNA sequences using Python. His first task is to compute the lengths of various sequences, three examples of which are given below. Write some code to compute the length of each of these sequences.
seq1 = "LYLIFGAWAGMVGTALSLLIRAELGQPGTLLGDDQIYNVIVTAHAFVMIFFMVMPIMIGGFGNWLVPLMIGAPDMAFPRMNNMSFWLLPPSFLLLLASSTVEAGAGTGWTVYPPLAGNLAHAGASVDLAIFSLHLAGVSSILGAINFITTAINMKPPTLSQYQTPLFVWSVLITAVLLLLSLPVLAAGITMLLTDRNLNTTFFDPAGGGDPVLYQHLFWFFGHPEVYILIL"
seq2 = "VGTALXLLIRAELXQPGALLGDDQIYNVVVTAHAFVMIFFMVMPIMIGGFGNWLVPLMIGAPDMAFPRMNNMSFWLLPPSFLLLMASSTVEAGAGTGWTVYPPLAGNLAHAGASVDLAIFSLHLAGISSILGAINFITTAINMKPPALSQYQTPLFVWSVLITAVLLLLSLPVLAAGITMLLTDRNLNTTFFDPAGGGDPVLYQHLFWFFGHPEVYILIL"
seq3 = "WAGMVGTALSLLIRAELGQPGALLGDDQIYNVVXTAHAFVMIFFMVMPIMIGGFGNWLVPLMIGAPDMAFPRMNNMSFWLLPPSFLLLMASSTVEAGVGTGWTVYPPLAGNLAHAGASVDLAIFSLHLAGISSILGAINFITTAINMKPPALSQYQTPLFVWSVLITAVLLLLSLPVLAAGITMLLTDRNLNTTFFDPAGGGDPVLYQHLFWFFGHPEVYILIL"
#Your code here
Now Li needs to find the average length of a collection of sequences, seq1
through seq3
above, in addition to the following two sequences. Write some code to find the average length of all five sequences.
seq4 = "VGTALSLLIRAELGQPGTLLGDDQIYNVIVTAHAFVMIFFMVMPVMIGGFGNWLVPLMIGAPDMAFPRMNNMSFWLLPPSFLLLLASSTVEAGAGTGWTVYPPLAGNLAHAGASVDLAIFSLHLAGVSSILGAINFITTAINMKPPALSQYQTPLFVWSVLITAVLLLLSLPVLAAGITMLLTDRNLNTTFFDPAGGGDPVLYQHLFWFFGHPEVYILIL"
seq5 = "LYLIFGAWAGMVGTALSLLIRAELGQPGALLGDDQVYNVVVTAHAFVMIFFMVMPIMIGGFGNWLVPLMIGAPDMAFPRMNNMSFWLLPPSFLLLLASSTVEAGVGTGWTVYPPLAGNLAHAGASVDLAIFSLHLAGISSILGAINFITTAINMKPPALSQYQTPLFVWSVLITAVLLLLSLPVLAAGITMLLTDRNLNTTFFDPAGGGDPVLYQHLFWFFGHPEVYILIX"
#Your code here
The lab director is pleased with Li’s progress and gives him another task. The lab is trying out some new software, and this software requires that each sequence used as input be prefixed with a number representing the length of the sequence, followed by a space. So for seq1
above, the input would look as follows (the sequence has been abbreviated for display purposes):
231 LYLIFGAWAGMVGTALSLL....
Write some code to add the sequence length (and a space) to the beginning of each of the sequences given above.
#Your code here
So far Li really likes working with Python; he can see how it will make his work in the lab a lot more efficient. One thing that confuses him, however, is that whole numbers in Python are represented sometimes as one-place decimals, e.g., 30.0
, and sometimes without the decimal part, e.g., 30
. Surely Python isn’t being arbitary? Comparing the output of the following two lines of code, what can you tell about the kinds of situations in which the decimal part appears when dealing with whole numbers?
len(seq1)
len(seq2) / 10
Solutions
Finding the length of strings
Li’s sequences are represented as Python strings – we can tell because each sequence is surrounded by quotation marks (""
). In Python, a string has a length, which we can find by using the built-in len()
function. To find the length of the string represented by the variable seq1
, we write
len(seq1)
To find the average of all five sequences, we add up the lengths of each sequence and divide by the total number of sequences (5). Note the use of parentheses in the code below, which ensures that the addition happens before the division. (Python’s order of operations is similar to what you’d expect from a calculator.)
avg_length = (len(seq1) + len(seq2) + len(seq3) + len(seq4) + len(seq5)) / 5
Concatenating strings
To prefix the length of the sequence to the sequence itself, we need to make a new string. In Python, we can join two strings together – called concatenation – by using the +
operator. Note that in the context of strings, the +
does not do numeric addition, it does concatenation. In other words, 1 + 2
produces 3
(addition), but "1" + "2"
produces "12"
(concatenation), because the quotation marks around the digits in the second example tell Python to treat them as strings, not numbers.
As a result, the following code will produce a TypeError
:
new_seq1 = len(seq1) + seq1
Here we asked Python to “add” an integer (the result of len(seq1)
) to a string (the value of seq1
), which is an undefined operation in Python. To remedy the error, we need to make sure that the values on both sides of the plus sign are strings. We can use the built-in str()
function to convert the integer result of len()
to a string. While we’re at it, we also concatenate a single space (between quotation marks) as required in the instructions:
new_seq1 = str(len(seq1)) + " " + seq1
Floats vs. integers
len(seq1)
returns231
(no decimal part) because thelen()
function is defined always to return a value of type integer. That makes sense; nothing in Python that has a length (e.g., a string, a list) is ever going to have a fractional length.len(seq2) / 10
returns22.0
, which is mathematically a whole number, but here it is given a one-place decimal representation because division in Python (the/
operator) is defined always to return a value of type float. Division returns a float even in cases where the result could be represented as an integer (as in this case). Python behaves this way in order to provide consistency: as a programmer, you can rely on the division operator always to return the same data type. Generally speaking, integers and floats in Python are interoperable, meaning that you can mix both freely in most types of calculation.If for some reason you need to convert a float to an integer, you can use the
int()
function: e.g.int(len(seq2) / 10)
returns22
(no decimal part).
Working with strings (1)#
Alia M. is a social scientist working with data on residential housing patterns. She has a large datafile of residential properties in the United States, consisting of street addresses and some information about the type of housing at each address. For each row in the datafile, the first part consists of a string that represents the property’s street address, including city, state abbreviation, and zip code. The address is preceded by a seven-digit unique identifier. One such string is given below.
address1 = "AGF5670 2123 N. 3rd St., St. Cloud, MN 56301"
Alia’s first task is to extract the unique identifier (AGF5670
in the example above) from each string. How can she use a string slice to accomplish this task? Write some code that extracts the identifier from the string stored in the address1
variable.
#Your code here
Next Alia would like to extract the five-digit zip code from the address string. (The zip code is 56301
in the example provided.) What code can she use to do that? (Assume that all zip codes in the datafile consist of five digits.)
#Your code here
Great work! This code is really going to make Alia’s life easier – no more manual data entry for her! The next challenge, as you might guess, is to extract the US postal code for the state. Assume that the addresses in this datafile all have a two-letter abbreviation (MN
, for Minnesota, in the example above).
#Your code here
Finally, Alia needs to extract the street address itself, together with the city name: e.g., 2123 N. 3rd St., St. Cloud
. The length of this portion of the address varies throughout the datafile, so ideally, your code would work for any string fitting the above pattern, regardless of length.
#Your code here
To make sure your code works properly on addresses of different lengths, try extracting the different parts of address2
below.
address2 = "WGA9753 412 Mockingbird Lane, Crowley, LA 70506"
#Your code here
Solutions
There are other ways to accomplish Alia’s tasks, but all of the above can be done using slicing. How can we determine that? Looking at the example given, and assuming that the rest of the datafile is consistent, we have the following pattern:
Alia wants to subdivide the string into four elements, three of which have a fixed length: the identifier, the state abbreviation, and the zip code. Only the street address and city are variable, and if we treat the latter as a single element, we can construct slices for all four elements using what we know about the lengths of the first three, together with negative indexing.
The table below shows the lengths of each element along with the indices for slicing (taking into account the white space between identifier, address, state, and zip code):
Element |
Length |
Slice From |
Slice To |
Slice Value |
---|---|---|---|---|
identifier |
7 characters |
0 |
7 |
|
street address & city |
varies |
8 |
-9 |
|
state abbreviation |
2 characters |
-8 |
-6 |
|
zip code |
5 characters |
-5 |
- |
|
To extract the identifier, we write:
address1[0:7]
oraddress1[:7]
. (These expressions are equivalent.)To extract the zip code, we count back five characters from the end, starting with
-1
, which yieldsaddress1[-5:]
. Note that we leave off the second number in the slice (after the colon) because we want to take a slice including the last character.To extract the state abbreviation, we count back again from the end to the
-8
position (theM
) and add 2 to get our slice:address1[-8:-6]
.To extract the remaining part of the address, we start from the second position after the end of the 7-digit identifier (to account for the white space, which we don’t want in our slice), and we take up to (but not including) the 9th character from the end (the space before the state code):
address1[8:-9]
.
We could use slightly different code for step 4 in order to exclude the comma after the city name ("St. Cloud"
), i.e., address1[8:-10]
.
N.b. We could have also accomplished these tasks using Python’s str.split()
method instead of slicing. See the next section for details.
Working with strings (and lists)#
Social scientist Alia has some additional data points in her file that she wishes to extract. These data consist of two alphanumeric codes and a four-digit year. The first element indicates the type of dwelling, the second indicates whether the property is a rental or occupant-owned, and the third indicates the year in which the property was built. The following example shows the string for a single-family rental property built in 1986.
address_info1 = "SF,R,1986"
Alia is feeling pretty confident about Python string slicing, so her first approach is to write three slice expressions to extract the information between the commas in the string above. Can you write those expressions below, assigning each part of the string (without the commas) to a new variable?
#Your code here
Alia is pleased with herself until she realizes that not all of the codes in the datafile are of the same length. The code in the first position can be either “SF” (single-family), “MF” (multi-family), or “Mixed” (for properties with a mix of residential and commercial spaces). Likewise, the second code can be either “R” (rental), “O” (owned), or “n/a” (a null value, used when the information was not available).
Since two of the three elements in this string are of variable lengths, it’s not easy to write slice expressions that will extract all three elements. Fortunately, Alia has learned about the str.split() method. Write some code that will split the address_info1
string into three parts, and assign each part to a new variable.
#Your code here
Now verify that your approach works for strings with elements of different lengths: address_info2
and address_info3
below.
address_info2 = "MF,n/a,1996"
#Your code here
address_info3 = "Multi,R,2007"
#Your code here
“That slice method is pretty cool!” Alia thinks. She wonders whether it would be a good idea to split the address strings as well (see the introduction to Working with Strings (1)), instead of slicing them.
Given the address1
variable below, what happens if you split it on the white space? Do you think this would be a good approach for separating out the elements of the string: the 7-digit identifier, the street address and city name (of variable length), the two-letter state abbreviation, and the five-digit zip code? Why or why not?
address1 = "AGF5670 2123 N. 3rd St., St. Cloud, MN 56301"
#Your code here
Using the result of splitting address1
, can you extract the identifier, state abbreviation, and zip code?
#Your code here
Bonus Now Alia is having trouble working with the street address and city name in the result of splitting address1
. A colleague mentioned the str.join() method, which does the inverse operation of str.split()
.
Can you use list slicing and the join
method to produce a single string corresponding to the street address and city name in address1
, starting from the result of splitting the latter?
Solutions
Since the three elements in the address_info
strings are separated by commas, we can use the str.split()
method with its optional first argument, sep
. Note that the Python documentation tells us the following:
If
sep
is not specified or isNone
, a different splitting algorithm is applied: runs of consecutive whitespace are regarded as a single separator, and the result will contain no empty strings at the start or end if the string has leading or trailing whitespace.
In other words, if we write address_info1.split()
with nothing between parentheses, the sep
argument will be None
by default, which means that the string will be split on white space, with any number of consecutive spaces being treated as a single instance (a single separator). That’s the default behavior, but we can modify this by providing a single argument to split
for the separator, either with or without the argument name:
address_info1.split(",")
and
address_info1.split(sep=",")
work equally well, giving the list ["SF", "R", "1986"]
as the result.
The same approach works also for address_info2
and address_info3
. In each case, we can access the individual elements by indexing into the list returned by split
:
result1 = address_info1.split(",")
property_type1 = result1[0]
rent_or_owned1 = result1[1]
year_built1 = result1[2]
Assuming that the strings with this information always have two and only two commas, the split
method will always return a three-element list.
We could take the same approach with the address1
string ("AGF5670 2123 N. 3rd St., St. Cloud, MN 56301"
), splitting this time on the whitespace (the default):
address_parts1 = address1.split()
Now address_parts1
will be the following list: ['AGF5670', '2123', 'N.', '3rd', 'St.,', 'St.', 'Cloud,', 'MN', '56301']
Assuming the strings in Alia’s datafile are consistently formed, we could extract the identifer, state abbreviation, and zip code as follows:
identifier1 = address_parts1[0]
state_abbreviation1 = address_parts1[-2]
zip_code1 = address_parts1[-1]
Extracting the street adddress and city proves more difficult, however. In our example, each part of the street address ("2123 N. 3rd St.,"
) is separated by white space, as are the two parts of the city name ("St. Cloud,"
). This situation poses two problems for using the str.split()
method:
It’s not going to be terribly useful – at least, not for Alia’s purposes – to treat the parts of the street address or city name as different elements. For instance, the “N.” and the “St.” in “N. 3rd St.” don’t really mean anything on their own, and the same goes for “St.” and “Cloud” in “St. Cloud.”
Different addresses and cities will involve different amounts of white space. Consider “3150 Elm St.” or “Roanoke.”
To to extract the address/city name element from our original address string using split, we could take the following approach:
Split the originl string on white space, producing a list.
address_parts1 = address1.split()
Take a slice of the resulting list (not the original string) in order to get the elements between the first (the identifier) and the second-to-last (the state abbreviation).
street_address1 = address_parts1[1:-2]
Note that
street_address1
is also a list. (Slicing a string returns a string; slicing a list returns a list.) We can use this list with thestr.join
method, which takes as its argument a list and returns a string that consists of each element in the list separated by whatever string the method is called on. The following code creates a new string out of the street address/city name elements by gluing them together (so to speak) with white space:
street_and_city1 = " ".join(street_address1)
In this case, splitting isn’t necessarily a better approach to Alia’s first problem than slicing. But the larger point is that there are almost always more ways than one of solving a problem with Python. In this case, which approach you choose might come down to your preferences as a programmer.
Working with lists#
Priya J.’s office at the university has surveyed students about the need for additional resources and support for learning programming languages. Having recently learned about Python herself, Priya would like to use it to analyze the survey results. She has managed to load the survey data from a CSV file using Python. Each row of the survey responses is a Python list.
One such list is shown below. The elements, in order from left to right, comprise the following data points:
Respondent’s email address
Department code (of the respondent’s major or field of study)
Respondent’s status (undergraduate or graduate student)
Respondent’s degree (BA, MA, PhD, etc.)
Respondent’s expected year of graduation
A string representing one or more programming languages the respondent has studied, separated by commas
Five numbers corresponding to Likert-scale responses to a series of questions (each response being a number between 1 and 10)
response1 = ["george.washington@gwu.edu", "Engl", "UND", "BA", 2025, "Python,R", 5, 6, 4, 7, 8]
From Python Camp, Priya knows that she can use indexing to access individual elements in a list, but she’s a little fuzzy on the details. Can you help her out? In the cells below, write some code to access the following elements:
The respondent’s email address
The respondent’s status (
"UND"
in the example provided)The respondent’s expected year of graduation
#Your code here
#Your code here
#Your code here
Priya needs to perform some data cleanup on the responses, as well as a preliminary analysis. First, she wants to anonymize the results, replacing each email address with a unique, random identifier. The standard Python installation includes a handy library for generating such identifiers called uuid. In the following code, Priya imports the uuid1()
function, which will generate a single identifier.
from uuid import uuid1
In the cell below, write some code that assigns the result of calling the uuid()
function to the first position in the response1
list, replacing the email address.
If your code works as intended the response1
list should look something like the following:
[UUID('ed603b16-a8c7-11ee-90be-ee5daa8b5d7a'), 'Engl', 'UND', 'BA', 2025, 'Python,R', 5, 6, 4, 7, 8]
#Your code here
Priya notices that the values for the possible department codes (the second element in the response1
list) were not coded consistently on the survey form. Some of them are given in uppercase (e.g., "Engl"
), while some are given in title case ("CHEM"
). To fix this problem, Priya decides to use the str.upper()
method to convert every department code to uppercase.
Write some code to replace the second element of the response1
list, which is a string, with its uppercase version, using the .upper()
method available on the string. (For example, if dept1
is a variable holding a string, dept1.upper()
will return a new string composed of converting the original string to uppercase.)
#Your code here
Now onto some analysis. Priya would like compute the average of the five response scores (the last five elements of the list) and add this value to the end of the list. That way, each row of responses will contain its average score as its final element.
Create a new variable, average1
, and assign it to the average of the five scores in response1
. Hint: Priya has learned about the sum() function, which takes as its argument a list of numbers and returns their sum. To find the average score, divide the sum of scores by the number of scores (5).
#Your code here
Bonus. Now Priya needs to add the value in average1
to the end of the list. She remembers learning about the list.append()
method; when called with an argument on a Python list, .append()
will add its argument to the end of the list. When Priya runs the following code, it produces no errors.
response1 = response1.append(average1)
But then when she runs print(response1)
, instead of displaying her list with the average included, she gets None
. What happened to her list?
Can you tell what went wrong? How should Priya append average1
to the response1
list?
#Your code here
Solutions
List indices
We can use the following code to access the values in the 1st, 3rd, and 5th positions in the response1
list. Remember that in Python, indexing starts at 0
, not 1
.
email1 = response1[0]
status1 = response1[2]
grad_year1 = response1[4]
Replacing an element in a list
To replace the the email address with a random indentifier, we can use indexing, too. The code below replaces whatever is in the 1st position with the result of calling the function uuid1()
:
response1[0] = uuid1()
The same logic holds for replacing the department code (position 2) with its uppercase variant. Note that we can call the str.upper()
method directly on response1[1]
because we know that the expression response1[1]
will return a value of type string (str
), and every string in Python has access to the .upper()
method.
response1[1] = response1[1].upper()
Slicing a list
The most efficient way to sum over the five scores in the response1
list is to access the scores with a slice. Slicing works for lists as well as strings; just as the slice of a string returns a string, the slice of a list returns a list. We can even use negative indexing. The slice syntax [-5:]
means “starting with the 5th element from the end, return the elements up to and including the last element in the list.”
average1 = sum(response1[-5:]) / 5
Appending to a list
Priya’s confusion stems from the fact that strings and lists, while both collections in Python, behave differently when using the methods associated with each. Whenever we modify a string in Python, we get back a new string (i.e., an entirely new sequence of characters in memory).
So if the variable dept1
holds a string, in order to make dept1
uppercase, we have to assign the result of the str.upper()
method back to the variable:
dept1 = dept1.upper()
Or if we’re applying the method directly to a string in a list, we assign the result back to that element:
response1[1] = response1[1].upper()
Lists, on the other hand, are modified in place in Python. In other words, modifying a list – e.g., by adding an element to the end – will transform the original list in memory, instead of creating an entirely new list. (This is done to conserve memory; since Python lists can contain any combination of valid data types, they can get quite large, depending on the program, and it wouldn’t be optimally efficient to have multiple copies of such lists sitting around in memory.)
The upshot is that to append to the response1
list, we use the following code, which calls the .append()
method without reassigning the result to the original list. (Note the absence of the equals sign).
response1.append(average1)
Working with dictionaries#
Priya has tried a different method of loading the survey responses from the CSV file, such that each row of responses is now a Python dictionary. The following cell defines one such dictionary.
r1_dict = {"email": "martha.washington@gwu.edu",
"department": "BISC",
"status": "GRAD",
"degree": "PhD",
"year": 2028,
"languages": "Python,R,C++",
"q1": 3,
"q2": 4,
"q3": 6,
"q4": 2,
"q5": 3}
In the cell below, write some code to access the values associated with the "degree"
, "languages"
and "q1"
keys. Assign each value to a new variable.
#Your code here
The "languages"
key is associated with a string representing one or more languages studied by the respondent. (If the respondent has studied no programming languages, the string “n/a” is used as a null value.) When multiple languages are present, they are separated by commas. Priya would like to convert this string to a list.
Write some code to split()
the value of the "languages"
key in r1_dict
on the commas, and assign the result back to the same key.
#Your code here
Previously, Priya calculated the average of all five question scores and appended the result to the end of the list that represented the survey response. Since she’s using a dictionary now, she can add the average as its own key.
Write some code to calculate the average of the values associated with "q1"
through "q5"
and add that result to the r1_dict
dictionary, using a key called "average_score"
.
#Your code here
Solutions
Accessing values by key
With lists we access individual elements by providing the index (as an integer) between square brackets ([]
). Access to dictionary elements follows a similar pattern, except that the element between brackets is the dictionary key (usually, but not always, a string).
To access the values for the "degree"
, "languages"
and "q1"
keys, we can write the following. Note the use of quotation marks around the keys inside brackets: that is important, because the keys are strings.
degree1 = r1_dict["degree"]
languages1 = r1_dict["languages"]
q1_1 = r1_dict["q1"]
Updating values by keys
Updating works the same as access. Note that if we want to update the value associated with a key, we put the expression with the key on the left side of the equals sign.
l = r1_dict["languages"].split(",")
r1_dict["languages"] = l
In the code above, the first line retrieves the value of the "languages"
key, splits the string on the commas, and then assigns the result to the temporary variable l
. We could also write this code in one line, as follows:
r1_dict["languages"] = r1_dict["languages"].split(",")
Conceptually, this code is similar to the code for reassigning a variable, e.g. n = n + 1
.
Adding new keys and values
Adding a new key & value uses the same syntax as updating the value of an existing key. Because dictionaries do not have a syntax analogous to slicing a list, in order to calculate the average of the respondent’s survey scores, we have to access each value separately:
average1 = (r1_dict["q1"] + r1_dict["q2"] + r1_dict["q3"] + r1_dict["q4"] + r1_dict["q5"]) / 5
r1_dict["average_score"] = average1
Loops & lists: Summarizing a list#
In Integers, Floats, & Strings, bioinformatics student Li calculated the average length of a number of protein sequences, where each sequence was represented by a separate Python string variable (seq1
, seq2
, seq3
, etc.). Li’s colleague pointed out that this approach will become cumbersome when dealing with large amounts of data: typically, the number of sequences to be analyzed runs to the hundreds or thousands, so it’s not feasible to create a separate variable to hold the string representing each sequence.
Instead, Li’s colleague proposes that he use a Python list to hold the sequences (a list of strings), and a for loop to do the calculation. After a little research, Li understands that he can use the loop to add up the lengths of all the sequences in the list, and he can divide the total length (of sequences) by the length of the list itself (the number of sequences).
The following code defines such a list of sequences:
sequences = ["FQTWEEFSRAAEKLYLADPMKVRVVLKYRHVDGNLCIKVTDDLVCLVYRTDQAQDVKKIEKF",
"KYRTWEEFTRAAEKLYQADPMKVRVVLKYRHCDGNLCIKVTDDVVCLLYRTDQAQDVKKIEKFHSQLMRLME LKVTDNKECLKFKTDQAQEAKKMEKLNNIFFTLM",
"EEYQTWEEFARAAEKLYLTDPMKVRVVLKYRHCDGNLCMKVTDDAVCLQYKTDQAQDVKKVEKLHGK",
"MYQVWEEFSRAVEKLYLTDPMKVRVVLKYRHCDGNLCIKVTDNSVCLQYKTDQAQDVK",
"EEFSRAVEKLYLTDPMKVRVVLKYRHCDGNLCIKVTDNSVVSYEMRLFGVQKDNFALEHSLL",
"SWEEFAKAAEVLYLEDPMKCRMCTKYRHVDHKLVVKLTDNHTVLKYVTDMAQDVKKIEKLTTLLMR",
"FTNWEEFAKAAERLHSANPEKCRFVTKYNHTKGELVLKLTDDVVCLQYSTNQLQDVKKLEKLSSTLLRSI",
"SWEEFVERSVQLFRGDPNATRYVMKYRHCEGKLVLKVTDDRECLKFKTDQAQDAKKMEKLNNIFF",
"SWDEFVDRSVQLFRADPESTRYVMKYRHCDGKLVLKVTDNKECLKFKTDQAQEAKKMEKLNNIFFTLM",
"KNWEDFEIAAENMYMANPQNCRYTMKYVHSKGHILLKMSDNVKCVQYRAENMPDLKK",
"FDSWDEFVSKSVELFRNHPDTTRYVVKYRHCEGKLVLKVTDNHECLKFKTDQAQDAKKMEK"]
Note that in the list above, a single variable – sequences
– is defined to hold the entire list of strings. To access each string in the list, we can either use indexing (see Working with Lists, or we can write a loop.
The loop variable in a for loop gives us access to each element in the list in order. For example, in the code below, the seq
variable would first get assigned the string beginning with FQTWEEF
, then the string beginning with KYRTWEEF
, and so on, for every string in the list:
for seq in sequences:
For loops are a very powerful construct for dealing with a large amount of data that needs to be processed in the same way.
In this case, the task is to add up the lengths of all the sequences. Li realizes that in addition to the loop variable, he’ll need to use another variable to keep track of the running total of lengths. Such a variable is defined below.
total_length = 0
Write a loop for Li that will add the length of each string in sequences
to the total_length
variable.
#Your code here
Now using total_length
, find the average length of sequences in the list. (Hint: you can find the number of elements in a list by taking its length.)
#Your code here
Bonus Can you modify the loop code from above in order to find the length of the longest sequence in the sequences
list? Hint: Python has a built-in max()
function that, when given two arguments, will return the larger of the two: e.g., max(1, 2)
returns 2
.
#Your code here
Solutions
Summing over a list
Since we have already defined total_length
by assigning it to 0
, to add up all the lengths, we just need to increment total_length
inside the body of the loop. Each time, add the length of the current list element, which we can find by applying the len()
function to the loop variable (seq
in the code below):
for seq in sequences:
total_length = total_length + len(seq)
Note that the line incrementing total_length
is indented, since we want it to be executed once per element in sequences
.
To find the average length, we then divide total_length
by the length of sequences
itself:
average_length = total_length / len(sequences)
The latter line of code should be executed after the loop is finished, so we don’t indent it. In terms of the logic of our code, it’s important to observe that len(seq)
(inside the loop) and len(sequences)
represent two very different things: the first represents the length of each string (one after the other, in order, as the loop executes); the second represents the length of the list itself, i.e., the number of strings (sequences) within it.
Finding the max length
To find the length of the longest sequence, we can use the following code:
max_length = 0
for seq in sequences:
max_length = max(len(seq), max_length))
The pattern is quite similar to what we used above. The only difference is that instead of reassigning total_length
to itself plus the length of the current sequence, we reassign max_length
to the result of comparing the previous value of max_length
and the length of the current sequence. The following table shows the values of max_length
and len(seq)
for the first few iterations of our loop:
Iteration |
Previous value of |
Value of |
New value of |
---|---|---|---|
1st |
0 |
62 |
62 |
2nd |
62 |
107 |
107 |
3rd |
107 |
67 |
107 |
Stepping through a loop like this is a good way to get a handle on what actually happens when the loop is run.
Loops & lists: Transforming a list#
Social scientist Alia would like to extract the seven-digit unique identifiers from a list of addresses, where each address is a string starting with the identifer.
The following code defines such a list.
addresses = ["UP63659 93 North 9th Street, Brooklyn, NY 11211",
"LH06225 380 Westminster St, Providence, RI 02903",
"AJ30648 177 Main Street, Littleton, MA 03561",
"QL46276 202 Harlow St, Bangor, ME 04401",
"QQ72122 46 Front Street, Waterville, ME 04901",
"PY84657 22 Sussex St, Hackensack, NJ 07601",
"CJ55552 75 Oak Street, Patchogue, NY 11772",
"FU76399 1 Clinton Ave, Albany, NY 12207",
"ZK38387 7242 Route 9, Plattsburgh, NY 12901",
"JF67659 520 5th Ave, Mckeesport, NY 15132",
"KJ13496 122 W 3rd Street, Greensburg, NY 15601",
"HW78470 901 University Dr, State College, PA 16801",
"PL44372 240 W 3rd St, Williamsport, PA 17701",
"NJ95036 41 N 4th St, Allentown, PA 18102"]
Alia knows that she can extract the identifier from each string by using a slice, such that if the variable addr
holds the string "UP63659 93 North 9th Street, Brooklyn, 11211"
, then the expression addr[:7]
will give the string "UP63659"
.
How can Alia write a for loop to perform this slice for every string in the addresses
list, storing the results (the identifiers) in a separate list called identifiers
?
(She knows that the first step is to define the new identifiers
list, which is empty to start with.)
identifiers = []
(She also knows that she needs to use the .append()
method, which is available on any Python list, to add each identifer to the identifiers
list.)
#Your code here
Solution
The goal here is to transform one list of strings (those representing complete street addresses, along with unique identifiers) into a new list, which will contains strings representing only the identifiers from the original list.
This is a common pattern, similar to the patterns used in Summarizing a list, except that where previously we were computing a single value (a total or maximum amount), here we’re populating a list of values derived from the original list.
Code to accomplish this (assuming that identifiers
has first been defined as an empty list, as above) might look as follows:
for addr in addresses:
unique_id = addr[:7]
identifiers.append(unique_id)
Alternately, skipping the intermediate variable assignment, we can write
for addr in addresses:
identifiers.append(addr[:7])
The crucial bit is the indentation of the line(s) after for addr in addresses:
– as elsewhere in Python, the white space designates the code that should be run inside the loop (i.e., the code to be repeated). Thus, we are taking the same slice (of the first seven characters) from every string (addr
) in addresses
, and then adding the result of each slice to the identifiers
list.
Nested data 1: Loops, lists, & dictionaries#
Data analyst Priya needs to calculate the average response score for each of five questions on her survey of GW students and programming support. As described in Working with Dictionaries, Priya has loaded a datafile using Python such that each survey response is represented by a dictionary. The complete set of responses is represented by a list of dictionaries, with each dictionary having the same keys but different values. A sample of the dataset is shown below.
responses = [{"id": "27a20f28-ab40-11ee-ac72-acde48001122", "department": "GEOG", "status": "GRAD", "degree": "MA", "year": 2025, "languages": "Python,R", "q1": 7, "q2": 6, "q3": 1, "q4": 1, "q5": 6},
{"id": "27a2100e-ab40-11ee-ac72-acde48001122", "department": "GEOG", "status": "UND", "degree": "BA", "year": 2026, "languages": "Python", "q1": 1, "q2": 4, "q3": 1, "q4": 9, "q5": 3},
{"id": "27a21068-ab40-11ee-ac72-acde48001122", "department": "CHEM", "status": "GRAD", "degree": "PhD", "year": 2025, "languages": "Python,R,C++", "q1": 2, "q2": 7, "q3": 5, "q4": 3, "q5": 4},
{"id": "27a210a4-ab40-11ee-ac72-acde48001122", "department": "ECON", "status": "UND", "degree": "BA", "year": 2024, "languages": "R", "q1": 7, "q2": 4, "q3": 8, "q4": 5, "q5": 2},
{"id": "27a210e0-ab40-11ee-ac72-acde48001122", "department": "PSYC", "status": "UND", "degree": "BA", "year": 2028, "languages": "Python", "q1": 1, "q2": 1, "q3": 4, "q4": 8, "q5": 9}]
For each response, the numeric scores for the five questions are stored under the keys "q1"
through "q5"
. Can you write some code to calculate the average score for each question, across all five responses in the responses
list?
Hint: you’ll need to use a for loop, combined with one or more variables, defined outside the loop, to store the running totals for each question’s scores.
#Your code here
Priya is interested in possible correlations between respondents’ answers to the five survey questions and the number of programming languages the respondents have studied.
The response data include, for each respondent, a string that consists of one or more programming languages, separated by commas. Priya knows that by using the str.split()
method with an argument of ","
(see Working with Dictionaries), she can obtain a list of programming languages for each respondent, and she reasons that by taking the length of that list, she can obtain an integer representing the number of languages reported by that respondent.
Can you write some code to add a "num_languages"
key to each dictionary in the responses
list, where the value of the key corresponds to the number of languages listed under the "languages"
key?
#Your code here
Bonus: If you find that your code for computing the averages seems rather redundant, with multiple lines doing almost the same thing, can you identify any ways to reduce the redunancy? (Shorter code is necessarily better code, but situations where code gets repetitive do tend to multiply opportunities for programmer error.)
#Your code here
Solutions
Accessing values in a list of dictionaries
To compute the average score for each of the five survey questions, a logical plan might look like the following:
Define five variables to hold the total scores for the five questions.
Loop through the list of responses.
For each response, access the value in each of the question keys (
"q1"
through"q5"
) and add it to the appropriate variable.Outside the loop, divide each response total by the number of responses.
Here is some code to implement this plan:
q1 = 0
q2 = 0
q3 = 0
q4 = 0
q5 = 0
for resp in responses:
q1 = q1 + resp["q1"]
q2 = q2 + resp["q2"]
q3 = q3 + resp["q3"]
q4 = q4 + resp["q4"]
q5 = q5 + resp["q5"]
q1_avg = q1 / len(responses)
q2_avg = q2 / len(responses)
q3_avg = q3 / len(responses)
q4_avg = q4 / len(responses)
q5_avg = q5 / len(responses)
Whenever you’re working with for loops, it’s a good idea to pay close attention to the indentation. THe five lines indented under the line for resp in responses:
will be executed once for each response in the list. This pattern allows us to tally up the total for each question. The lines beginning with q1_avg
, etc., are not indented; these lines will each run only once, after we have finished adding up the totals. (Feel free to use as many blank lines as you like in order to make your code readable; in most situations, blank lines are ignored by the Python interpreter.)
Adding values to dictionaries in a list
To add the number of languages to the data for each respondent, we also need to use a loop. The logical plan for this task is a bit more straightforward: for each response, (1) we need to access the value stored under the "languages"
key, which we expect to be a Python string; (2) then we need to split that string on the commas, (3) take the length of the resulting list, (4) and store the result under a "num_languages"
key in that response dictionary:
for resp in responses:
languages = resp["languages"]
resp["num_languages"] = len(languages.split(","))
The code provided does steps 2, 3, and 4 in the same line, but you could also code those as separate lines. The crucial parts are that we’re retrieving the value from the "languages"
key (resp["languages"]
is on the right side of an equals sign), doing something to that value, and saving the result under the "num_languages"
key (resp["num_languages"]
is on the left side of an equals sign).
And don’t forget the quotation marks around "languages"
and "num_languages"
when using them as dictionary keys – the dictionary keys are strings, and here we’re providing their literal string values.
Bonus
To reduce repetition when finding the average scores, we can use a dictionary to store the running totals instead of separate variables.
For instance, we could define a total_scores
dictionary, with a key for each question, and we could initialize them all to 0:
total_scores = {"q1": 0,
"q2": 0,
"q3": 0,
"q4": 0,
"q5": 0}
Then, inside the for loop that gives us each response, to avoid having to type total_scores["q1"]
, total_scores["q2"]
, etc., we can take advantage of the fact that looping over a dictionary in Python returns the dictionary keys. Then the rest of logical plan would look like this:
Loop over the list of response.
For each response, loop over the keys in the
total_scores
dictionary. (We do this inside the first loop.)For each key from
total_scores
, increment its value intotal_scores
with the value of the same key in the current response dictionary.To find the averages, loop over the
total_scores
dictionary again (after the loop over the responses is done), dividing the value for each key by the length ofresponses
.
The following code implements our plan:
for resp in responses:
for q in total_scores:
total_scores[q] = total_scores[q] + resp[q]
for q in total_scores:
total_scores[q] = total_scores[q] / len(responses)
Note that this code is less repetitive than our previous version, but it’s also potentially harder to understand. In using this code, it’s important to keep in mind that the loop variable resp
and the loop variable q
refer to very different things:
resp
will hold each of the dictionaries in the listresponses
q
will hold each of the keys, which are strings intotal_scores
: so"q1"
,"q2"
, etc.
This code illustrates using a variable as a key to access a value in a dictionary. So resp[q]
is equivalent to resp["q1"]
, resp["q2"]
, and so on.
Feel free to stick with the initial, more repetitive implementation. There’s nothing wrong with code that takes up a few more lines if it’s easier to reason about or understand.
Nested data 2: Creating a list of dictionaries#
In analyzing nucleotide sequences, it’s common to count the number of bases of each type – A, G, C, T – that appear in the sequence. A list of such sequences is defined below, where each sequence is a string consisting of the same four characters but in different combinations.
sequences = ["ggtaagtcctctagtacaaacacccccaatattgtgatataattaaaattatattcatattctgttgccagaaaaaacacttttaggctatattagagccatcttctttgaagcgttgtc",
"ggtaagtcctctagtacaaacacccccaatattgtgatataattaaaattatattcatattctgttgccagaaaaaacacttttaggctatattagagccatcttctttgaagcgttgtc",
"ggtaagtgctctagtacaaacacccccaatattgtgatataattaaaattatattcatattctgttgccagattttacacttttaggctatattagagccatcttctttgaagcgttgtctatgcatcgatcgacgactg",
"tgcaccaaacatgtctaaagctggaaccaaaattactttctttgaagacaaaaactttcaaggccgccactatgacagcgattgcgactgtgcagatttccacatgtacctgagccgctg",
"caactccatcagagtggaaggaggcacctgggctgtgtatgaaaggcccaattttgctgggtacatgtacatcctaccccggggcgagtatcctgagtaccagcactggatgggcctcaa"]
Li wants to analzye the sequences in the sequences
list. For each sequence, he needs to record how many times each base occurs in the sequence. Since each nucleotide consists of the same four bases, one approach would be to create a list of dictionaries, where each dictionary would contain a key for each base. For example, a single sequence might be represented by the following dictionary:
{"a": 24,
"c": 55,
"g": 33,
"t": 28}
The values in the dictionary record the number of occurences of each base. The following is Li’s logical plan for implementation:
Define a new, empty list to hold the base counts for each sequence..
Loop over the list of sequences.
For each sequence, define a new dictionary to hold the base counts, initializing each base to 0.
Within the outer loop, loop over the letters (the bases) in the sequence itself. (Looping over a string gives each character in turn.)
For each base, increment the key corresponding to that base by 1.
Add the dictionary of base counts to the list.
Can you write some code to implement Li’s plan?
#Your code here
Solution
The following code has been annotated with comments corresponding to the steps from the plan described above.
base_counts = [] #Step 1
for seq in sequences: #2
base_count_dict = {"a": 0, "c": 0, "g": 0, "t": 0} #3
for base in seq: #4
base_count_dict[base] += 1 #5
base_counts.append(base_count_dict) #6
As usual, the indetation levels are of particular importance. There are three levels in the code above. The following table makes explicit what is happening at each level.
Level |
Step |
Action |
---|---|---|
1 |
1 |
Defines the empty list to hold all the dictionaries |
1 |
2 |
For loop: over the list of sequences |
2 |
3 |
Defines a new dictionary to hold the base counts for the current sequence in the loop |
2 |
4 |
For loop: over the characters in the current sequence string |
3 |
5 |
Using the loop variable from the inner loop, which holds the current character, increments the value in the dictionary associated with that key |
2 |
6 |
Adds the dictionary to the list of base counts (using the list |
The other thing to notice is that we use the base
loop variable as the key in the base_count_dict
dictionary, because this variable will hold each character from the sequence in succession, and we have defined base_count_dict
to have a key for each character. The keys are one-character strings: "a"
, "c"
, "g"
, and "t"
.
Nested data 3: Loops and conditionals#
Using Python, Alia has transformed her dataset into a list of dictionaries, such that each dictionary contains data about a particular residential property, including its address and the year in which the dwelling was built. The dataset now looks like the following example:
address_data = [{"street_address": "93 North 9th Street",
"city": "Brooklyn",
"state": "NY",
"zip": "11211",
"year_built": 1995},
{"street_address": "380 Westminster St",
"city": "Providence",
"state": "RI",
"zip": "02903",
"year_built": 1979},
{"street_address": "177 Main Street",
"city": "Littleton",
"state": "MA",
"zip": "03561",
"year_built": 1992},
{"street_address": "202 Harlow St",
"city": "Bangor",
"state": "ME",
"zip": "04401",
"year_built": 1964},
{"street_address": "46 Front Street",
"city": "Waterville",
"state": "ME",
"zip": "04901",
"year_built": 2007},
{"street_address": "22 Sussex St",
"city": "Hackensack",
"state": "NJ",
"zip": "07601",
"year_built": 1983},
{"street_address": "75 Oak Street",
"city": "Patchogue",
"state": "NY",
"zip": "11772",
"year_built": 1996},
{"street_address": "1 Clinton Ave",
"city": "Albany",
"state": "NY",
"zip": "12207",
"year_built": 1961},
{"street_address": "7242 Route 9",
"city": "Plattsburgh",
"state": "NY",
"zip": "12901",
"year_built": 1932},
{"street_address": "520 5th Ave",
"city": "Mckeesport",
"state": "NY",
"zip": "15132",
"year_built": 1958},
{"street_address": "122 W 3rd Street",
"city": "Greensburg",
"state": "NY",
"zip": "15601",
"year_built": 1976},
{"street_address": "901 University Dr",
"city": "State College",
"state": "PA",
"zip": "16801",
"year_built": 1939},
{"street_address": "240 W 3rd St",
"city": "Williamsport",
"state": "PA",
"zip": "17701",
"year_built": 1934},
{"street_address": "41 N 4th St",
"city": "Allentown",
"state": "PA",
"zip": "18102",
"year_built": 1981}]
Alia would like to do some statistics on the ages of the dwellings in her dataset, and she’d like to group the analysis by state. Ideally, she like to be able to select a state and see a list of the years in which dwellings were built. So using the example above, "PA"
would correspond to 1939, 1934, and 1981.
A reasonable structure would be a dictionary in which the keys are strings corresponding to the state abbreviations, and the values are lists of years. The entry for Pennsylvania would look as follows:
{...,
"PA": [1939, 1934, 1981],
...}
Her logical plan is as follows:
Define an empty dictionary to hold the new data; call it
years_by_state
.Loop over the
address_data
list, each element of which should be a dictionary with a"state"
key and"year_built"
key.Get the value of the
"state"
key from the current address (the loop variable).This value becomes the key to use with the
years_by_state
dictionary.Get the value of the
"year_built"
key from the current address (the loop variable).This becomes the value to add to the
years_by_state
dictionary. But since we want to associate multiple years with each state abbreviation, we should append them to a list, rather than simply assigning the value to the key (which would in fact overwrite any previous value associated with that key).
Alia has written the following code, but she keeps getting a KeyError
. Can you help her fix it?
years_by_state = {}
for addr in address_data:
state_key = addr["state"]
year_value = addr["year_built"]
years_by_state[state_key].append(year_value)
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
Cell In[73], line 5
3 state_key = addr["state"]
4 year_value = addr["year_built"]
----> 5 years_by_state[state_key].append(year_value)
KeyError: 'NY'
Hint: Python will not allow you to use the append()
method on a list that hasn’t been defined. Since each key in years_by_state
is associated with a list, the first time you add a key to the dictionary, you need to define the list. To test whether a key is already in a dictionary, you can use an if statement with a Boolean expression. The following expression evaluates as True
if the key "NY"
is already in the dictionary years_by_state
and as False
otherwise:
"NY" in years_by_state
#Your code here
Solution
A straightforward solution involves modifying Alia’s code with an if...else
block. The logic is as follows:
If the dictionary value of
addr["state"]
is already present inyears_by_state
as a dictionary key, thenAssume that this key is associated with a list,
And use the
list.append()
method to add the value ofaddr["year_built"]
to that list.
Otherwise, add the value of
addr["state"]
as a key whose value is a list with a single element, the value ofaddr["year_built"]
.
years_by_state = {}
for addr in address_data:
state_key = addr["state"]
year_value = addr["year_built"]
if state_key in years_by_state:
years_by_state[state_key].append(year_value) #4a
else:
years_by_state[state_key] = [year_value] #4b
Note that the expression years_by_state[state_key].append(...)
works because of the order in which Python evaluates expressions:
The value of
state_key
is retrieved from memory, which should be a string (a two-letter state abbreviation), per the definition of theaddress_data
variable and the operation of the loop.This string is used as a key to retrieve a dictionary value from the
years_by_state
variable. This value will be a list, per the logic of the code.The latter value, being a list, has an
append()
method by default, and this method is used to add a new element to the list.
The logic might be a little clearer if we reverse the conditions in the if...else
block. (Note the use of not
in the fourth line.)
years_by_state = {}
for addr in address_data:
state_key = addr["state"]
year_value = addr["year_built"]
if not state_key in years_by_state:
years_by_state[state_key] = [year_value] #4b
else:
years_by_state[state_key].append(year_value) #4a
The code above produces the same result; the only difference is the order in which we, the programmers, read the conditions. But this order makes it clear that each key in the years_by_state
dictionary is created to hold a list of one element, and that this list is subsequently appended to.
Loops & conditionals: Categorizing data#
Pryia is analyzing survey data wherein each respondent has indicated the programming languages they have studied. In the following extract of her datafile, each response is associated with a unique, anonymous identifier and a list of languages. If the repondent indicated no languages, the list is empty.
responses = [{"id": "c9e703e0-ad9a-11ee-ac72-acde48001122",
"languages": ["R", "Java", "Python"]},
{"id": "c9e7089a-ad9a-11ee-ac72-acde48001122",
"languages": []},
{"id": "c9e70976-ad9a-11ee-ac72-acde48001122",
"languages": []},
{"id": "c9e70a16-ad9a-11ee-ac72-acde48001122",
"languages": ["R"]},
{"id": "c9e70a8e-ad9a-11ee-ac72-acde48001122",
"languages": ["Python", "C/C++"]},
{"id": "c9e70b06-ad9a-11ee-ac72-acde48001122",
"languages": ["Other"]},
{"id": "c9e70b74-ad9a-11ee-ac72-acde48001122",
"languages": ["R", "Python"]},
{"id": "c9e70bec-ad9a-11ee-ac72-acde48001122",
"languages": ["R", "Python"]},
{"id": "c9e70c64-ad9a-11ee-ac72-acde48001122",
"languages": ["Python"]},
{"id": "c9e70cc8-ad9a-11ee-ac72-acde48001122",
"languages": ["R"]}]
In order to simplify her analysis, Priya has decided to code her data as follows:
If the respondent indicated more than one language, the response is coded as a 3.
If the respondent indicated only one language, but that language is either Python or R, the response is coded as a 2.
If the respondent indicated no language, or they indicated a single language other than Python or R, the response is coded as a 1.
Her logical plan is as follows:
Create a new list to hold the coded data.
Loop over the
responses
list; for each response, create a new dictionary to hold the ID of the current response and its language code.Use an
if-elif-else
statement to evaluate the data associated with the"languages"
key.Based on which branch of the if statement is
True
, assign a value to a"language_code"
key in the current response dictionary, and append the latter to the new list.
Can you write some code to implement Priya’s plan?
Hint: To test whether a string is present in a list, use an expression with the keyword in
. For instance, if the variable lang_list
holds a list of string, to check if the string "Python"
appears in lang_list
, we can write:
if "Python" in lang_list:
#Your code here
Solution
The following is one potential solution. The commented numerals in the code refer to the annotations below
coded_responses = []
for resp in responses:
coded_resp = {} #1
coded_resp["id"] = resp["id"] #2
lang_list = resp["languages"]
if len(lang_list) > 1: #3
lang_code = 3
elif "Python" in lang_list or "R" in lang_list: #4
lang_code = 2
else:
lang_code = 1
coded_resp["language_code"] = lang_code #5
coded_responses.append(coded_resp) #6
Since we want the data structure for the coded responses to have the same structure as the original
responses
list, we define a new, empty dictionary to hold each coded response. (We technically could reuse the dictionary fromresponses
– which is held by the loop variableresp
– but this approach keeps things cleaner, and it preserves our original data as is, should we need it later.We need to copy the value of the
"id"
key over to the new dictionary.Here we’re testing our conditions against the list held by the
"languages"
key in each dictionary fromresponses
. The order of our if statements in this case does matter, because the first condition to evaluate toTrue
will be executed. According to Priya’s specifications, if the response includes more than one language, regardless of language, it should be coded as a 3. In other words, a response with a value forlang_list
of["Python", "R"]
would receive a 3, not a 2. In order to implement this logic in Python, we need to test for this condition first.This condition is evaluated only if
lang_list
has fewer than 2 elements. Note the use of theor
keyword to check for both conditions. Note also that we cannot writeelif "Python" or "R" in lang_list
. (Technically, we can write that, but it won’t have the desired effect:elif "Python" ...
will always evaluate toTrue
, no matter what comes after, because a non-empty string is always coerced toTrue
by an if statement. The correct, though unfortunately verbose, way to code this statement is to repeat thein lang_list
expression for each string we’re testing.Because
lang_code
is assigned something in every branch of theif-elif-else
statement, we can assume that it will have the correct value at this point. And be careful about indentation – if this line were indented under theelse:
line, it would get executed only for cases wherelang_code
is1
– not at all what we want!It’s easy to forget to append our dictionary to the new list. But if we omit this step,
code_resp
will get a new dictionary on every iteration of the loop, and each dictionary will overwrite the last.