Homework Assignment #3 — Mutation Testing

In this assignment you will write a tool to carry out simple mutation testing on Python programs. Given a program, your tool must generate mutants of that program that "kill" (cause to fail) test cases. You should generate mutants that are strong enough to kill a test case but not so strong as to kill every test case (cf. mutation adequacy score).

You may work with a partner for this assignment. If you do you must use the same partner for all sub-components of this assignment. Use the Gradescope feature to select your partner. Only one partner needs to submit the report on Gradescope, but if you both do, nothing fatal happens.

You are still allowed to use online (or "on-this-webpage") resources and the like as long as you cite them in your writeup.

Cross-Platform Compatibility and Python

It is your responsibility to submit Python code that works on the grading server. This is true even if (indeed, especially if) the grading server uses a different version of Python than your development environment.

This assignment uses Python because the focus is on the mutation algorithm and not the compiler front-end. In theory, it would be just as easy to do this assignment on C++ code using LLVM (e.g., via clang) or the like. In practice that would just add out-of-scope overhead without reinforcing course concepts. If you are interested in such topics, consider the Compilers and Programming Languages electives.

HW3a — Mutation Testing (Python)

You mutation testing tool will take the form of a Python program called mutate.py. Your mutate.py must define a mutate() method that accepts a single argument (an abstract syntax tree) and returns a single value (a mutated abstract syntax tree).

We provide driver code that calls your mutate.py multiple times to create multiple mutants. You are responsible for writing code that interfaces correctly with this driver.

The driver code will invoke your mutate() method and store the results in the current directory as 0.py, 1.py, 2.py, etc. Each mutant is a copy of the original program with one or more mutation operators (see below) applied. Any other output (e.g., logging to standard output) that your mutate.py produces is ignored.

The data structure used to represent Python source files is the abstract syntax tree (AST).

You should use the ast module as well as the astor module.

Many students report that the so-called "missing Python AST docs" are quite helpful here.

Other students and instructors have found this quick Python AST Tutorial to be quite helpful (thanks S. Ajami) for the basic concepts.

Your program can (and should) use random to decide where and how to apply mutation(s). However, the driver code is responsible for setting the random seed. This is done to ensure that your local testing and the grading server are as close as possible. Your code cannot set the random seed.

The ast.NodeTransformer portion of the interface is likely to be quite relevant for this assignment.

Mutation Operators

At minimum, you must implement and support the following three mutation activities:

Negate any single comparison operation type (e.g., >= becomes <).
- (You should strongly consider handling more than one comparison type. You have free choice for what you do, if anything, for other "corner-case" booleans such as is or True.)
Swap binary operators + and -, as well as * and //.
- (You have free choice of which AST node classes are relevant here, such as how or whether you deal with FloorDiv vs. regular division. Any non-empty choice is full credit. You should strongly consider handling more than binary operators as well.)
- (You can swap operators "as you like". The most obvious approach is to swap + with -, and so on, but you can try something more creative if you like.)
Delete an assignment or function call statement (i.e., cause it to have no effect).
- (You have free choice of which AST node classes are relevant here. For example, you might decide that ``assignment'' means Assign, AnnAssign and AugAssign. You may also choose to deal with assignments, function calls, or both. Any non-empty choice is full credit.)
- (Other edits that have the "effect" of deleting the assignment or function call but are not "officially" deletion are totally fine and are full credit.)
- (Similarly, whether you interpret function call as func() or x = func() or both is up to you.)
- (Be careful not to invalid Python code!)

Notes:

Note that you should implement these mutations, but they do not all have to be equally likely — or even occur at all. For example, you could have code that would correctly swap * to //, but you decide to make it run 0% of the time because you think not doing it will be better for mutation testing. That's fine and full credit: we can see your source code to tell that you completed the assignment, but you can still get high marks on the autograder by not producing overly-dangerous mutants.
Students are often anxious about adhering to this part of the specification. The quality of your mutation test generator will be assessed automatically by the autograder, regardless of which operators you use. Separately, whether or not you implemented the right operators will be assessed by a human — a live person who is not out to get you. Any reasonable choice you make here is fine, just document it. Please be reassured. The pedagogical goal is to get you experience with some different AST node types and mutations, not to make sure that you found exactly the AnnAssign vs. AugAssign node types and memorized the difference between them.

Note that implementing those three actions may not be enough to get full points on the autograder, but you do have to implement those three and describe them in your report to get full credit on the report. You may also need to implement more operators or heuristics to get full credit on the autograder.

Note also that you do not have to use all three with the same frequency (or even use them at all — you could set their probabilities to zero). You do, however, have to implement them as a minimal baseline.

Mutation Testing

For this assignment we will make use of a simple target subject program called subject.py. It is intentionally short and simple so that you can read and comprehend it directly (cf. white-box testing). The focus of this assignment is on mutation analysis, not on large programs per se (there are many other places in the course where we use large programs).

Recall that one goal of mutation testing is to assess the adequacy of a test suite. We would thus like to produce mutants that give different adequacy scores to different test suites. A more complete test suite should get a higher score.

To get started, download hw3.zip, which contains three Python files. You will need to install two packages:

pip install func_timeout --break-system-packages
sudo apt install python3-astor

We invoke the program by running the driver.py code. You may see something like this:

$ python3 driver.py
using your mutate.py to create 10 mutants
        ...
---
Your mutants selectively killed 2 of the 17 tests!
+ f08(0,0)
+ f10([0, 3, 5, 6, 7, 8, 9],0)

You can view the mutants (e.g., 0.py) in the current directory. They are overwritten each time.

The goal is to complete mutate.py so that it tends to generate selective mutants: those that kill exactly one test. You can (and should!) read all of the provided code, including the tests.

The autograder is identical to the files you have, but it generates 50 mutants instead of ten. Once you feel comfortable, you can can edit driver.py to generate 50 mutants locally. This allows you to test ideas out locally without going to the autograder.

The most common student questions for this assignment are along the lines of "I'm not sure how to generate selective mutants that are neither too strong nor too weak. My mutation analysis cannot every kill test X alone. What should I do?" That is the key puzzle of this assignment. We explicitly will not tell you more that what you see on this page.

Using the Library — Visitor Patterns and Starter Code

A very common source of student confusion relates to how the AST library is structured. It uses a visitor pattern, a way of structuring and interfacing code. We will cover such "Design Patterns" later in class. For now, a visitor pattern provides abstraction and modularity by hiding from you the exact fields and way in which a data structure is traversed or transformed. (This allows the library writer to change the internal representation of the data structure later, such as to move from Python 2 to Python 3 or whatever, without all of the client code needing to update.)

The mutate.py starter code provided shows two simple mutations implemented using a visitor pattern in two passes.

In the first pass, we visit every node in the tree and count the number of relevant nodes. In the second pass, we pick a random number between 1 and that node count, and then visit every node again. When we get to the chosen node, we edit it.

Students can also find other relevant example code online, such as wrap_integers.py, which may help with understanding the visitor pattern software design pattern.

Hints, Advice and Doing Well

Many students report that they find it easy to get a few points on this assignment but hard to get a high score. That is, it is hard to write a mutate.py that produces mutants that are strong enough, but not too strong.

If you find that your program is not producing enough high-quality mutants to assess the adequacy of a test suite:

Try making a high-quality test suite locally and evaluating against it. You can change driver.py and/or subject.py locally. After the previous homeworks, you know a bunch of good ways to make a high quality test suite. You could do all of this yourself. The advantage of doing it yourself is that you can try things out "locally" as often as you want without hitting the submission limit.
Take a look at the feedback from the driver. If many of your mutants are "too strong", you are likely using mutation operators that are "too powerful" and are creating mutants that are killed by every test suite. Otherwise, you are likely using mutation operators that are too weak (don't change relevant things, don't change enough, etc.).
Make sure you actually have a chance of mutating every relevant node and using every category of node or mutation. The starter code has some logic for this, but you may want to revisit it.
Consider carefully the number of mutation operators you apply to create a mutant. The starter code only ever applies one. You probably want to consider higher-order mutations — but you want to be very sparing with them!
You may want to implement additional mutation operators. See Section II.b of Jia and Harman's An Analysis and Survey of the Development of Mutation Testing for ideas. (This is actually the best hint we provide, but students often overlook it because it seems like "work" to skim the paper.)

You may want to avoid generating "useless" or "stillborn" mutants, such as those that do not parse, always exit with an error, are equivalent to a previously-generated mutant, etc. See Sections III and IV of the Jia and Harman survey for ideas. Here are some common student pitfalls. You should check locally to see if you are making any of these.

Not running pylint! After creating your mutants, you can just run pylint -E [0-9]*.py. Change your mutation operations to minimize the number of linting errors reported. (Run pylint on its own, not inside some other shell script that is piping output somewhere.)
Creating Python programs with bad indentation. Be careful about deletions. Since Python cares about indentation and the like, just removing something entirely can mess up the rest of the program. Perhaps instead of removing something you may want to replace it with a statement that has no effect, such as 1 ?
Creating Python programs that reference variables before those variables are defined. Hack: what if you kept track of the first time each variable was mentioned and never deleted the first instance and also never copied a reference before its first mention?
Flipping a coin at each node to decide whether or not to mutate it. This is problematic, because you'll end up with some mutants with 0 edits and some with 100 edits. Instead, try fixing a budget in advance (e.g., "this time we will do 2 mutations") and then choosing random places to apply them.
Spending too much time enumerating many possible mutations or mutation locations first. You will likely hit the timeout on the autograder — find a faster way to make a choice at random.

HW3b — Written Report

You must also write a short three-paragraph PDF report reflecting on your experiences creating a mutation testing tool for this assignment. Your report must include your University email address(es). Consider addressing topics such as:

Which mutation operators did you implement?
How does your program decide which operators to apply?
What was harder than you expected or different than you expected?
How would you compare and contrast mutation analysis with statement coverage as a test suite adequacy metric.

Rubric:

5 points — a three-paragraph report reflecting on your activities creating a mutation testing tool for this assignment, including a discussion of mutation operators and your experiences and expectations
4 points — a reasonable report, but lacking a solid description of one or more aspects or being rote rather than insightful or significantly exceeding the length limit
3 points — a brief report, detailing only half of the required information
2 points — a report drawing only the bare minimum of contrasts comparisons, and explanations
1 point — a terse or uninformative report, perhaps describing only one activity or operator

-1 point (etc.) — submitted source code does not implement the require operators (even loosely)
-1 point — English prose or grammatical errors
-1 point — Submission appears to be AI-generated without meaningful revision, at the grader’s discretion. Note: Written Report and Contribution Reflection will be assessed by separate grading teams.

The grading staff will select a small number of excerpts from particularly high-quality or instructive reports and share them with the class. If your report is selected you will receive extra credit.

Sanity Checking Your Submission

The autograder uses the same driver.py and subject.py that you are given. You can do as many tests as you like locally!

HW3c — Collaboration Reflection

You must also complete a short written reflection about how you collaborated with others and how you made decisions about correctness in this assignment. Download the reflection template, edit it with your answers, and submit the completed file via the autograder.io server.

Important: The reflection report is graded independently from the written report, and by a separate grading team. If you report using AI tools in the reflection report, this will not affect your HW grades. You will only lose points in your written report if your report appears to be AI-generated without meaningful revision.

Submission

Submit a single mutate.py file to the autograder. Submit a single PDF report (HW3b) via Gradescope. You must also submit a separate contribution reflection (HW3c) via the autograder. In your PDF report, you must include your name and UM email ID (as well as your partner's name and email ID, if applicable). The reflection must be submitted individually.

FAQ and Troubleshooting

In this section we detail previous student issues and resolutions:

Question: It's generally working, but I'm not getting the score I want on the autograder.

Answer: One student suggests: "exploring out and covering more mutation operators rather than trying to get really granular on how to mutate each operator will lead to success a lot faster!"
Question: When I try to create mutants, even locally, the to_source function complains.

Answer: If you create a Python AST that is missing critical information (like an important tree child, or location information) or otherwise create something totally invalid, the library won't even let you print it out. For example, if you had "x = y + z" and you remove the "z" node entirely without doing anything else, the library might not even print out the result at all because it is not valid Python. Even something as simple as saying ast.GtE instead of ast.GtE() can cause this.
Question: My mutate.py somewhat works, but I see this sort of error:
```
Traceback (most recent call last):
  File "0.py", line 42
    else:
       ^
IndentationError: expected an indented block
```
Answer: You are creating mutants that are not valid Python programs. In this example, the student is removing an entire statement right after an else: — but then Python gets confused because it is expectint a tabbed-over statement there. So the mutant is not a valid Python program. Check out the hints on how to deal with this sort of issue.
Question: My submission does not seem to create any mutants on the autograder and/or it can't seem to find them:
```
ls: cannot access '[0-9]*.py]': No such file or directory
```
Answer: This almost always means that your mutate.py has an error (such as importing a library not found on the autograder, or raising an exception) that means it is not running to completion.

This issue is, in some sense, the most common student problem. There is no silver bullet here: test locally and make use of the sanity check submission.
Question: It feels like the autograder is non-deterministic. I feel like I am submitting the same mutate.py twice and getting different results.

Discussion: There are many possible issues here that all result in the same observed symptom.

Answer 1: Your mutate.py may be non-deterministic. (This is rare, since student are careful about it.)

Answer 2: Your mutate.py may be creating mutants that are, themselves, non-deterministic. This is much trickier, and happens surprisingly often.

Answer 3: There may be differences between the autograder setup and your local setup. For example, the autograder uses export PYTHONHASHSEED=0 to try to determinize the order in which hashmaps are iterated.
Discussion: Ultimately, it may not be worth your time to track this down. Remember that we use your best submission, not your last one. Start early.
Question: I am getting erors like TypeError: 'type' object is not subscriptable. Help!

Answer: Double-check the spelling on the AST pieces you are creating or manipulating. For example, use ast.GtE() not the incorrect ast.GtE (note the parentheses, which are required).
Question: When I run locally, I get
```
ImportError: No module named 'astor'
```
Answer: Double-check the installation instructions above.
Question: I am trying to delete things, but I get this error:
```
assert node is None, node
AssertionError: 
```
Answer: You have ast.Pass but want ast.Pass() instead (note the parentheses). However, you may actually not want to "delete" things like this at all (look around for other hints).
Question: When I run my mutate.py, I get this error when trying to print out my mutants:
```
"AttributeError: 'Assign' object has no attribute 'value' "
```
Answer: This error occurs when the AST library tries to pretty-print your abstract syntax tree data structure back out to (ASCII) Python source. The AST library expects each node object to have certain fields (properties) filled out. Sometimes a node operation that can look totally reasonable to you (e.g., making a new node, setting it as the child of some other node) can result in a node object not having all of the fields set. If you are changing the types of nodes directly, I would recommend that you instead use their friendly constructors — that may well end up fixing this problem for you. (This is explicitly a bit of a challenge; dig around first and get used to the AST library before asking for help here.)
Question: How can I get the parts of an AST node? Suppose I want to figure out the identified on the left-hand side of an assignment:
```
Assign(targets=[Name(id='x', ctx=Store())], value=Num(n=9))
```
Answer: Check out the node documentation. In this particular example, some students report success with node.targets[0].id to extract the id from an instance of the the Assign class.

Similarly, suppose you want to change the right-hand side of the assignment to "None", but leave the left-hand side alone. Have your node transformer do something like:
```
return ast.copy_location(ast.Assign(targets=node.targets, value=ast.NameConstant(value=None)), node)
```
Question: I'm getting errors like
```
AttributeError: module 'ast' has no attribute 'Constant' 
```
or
```
AttributeError: 'Classdef' object has no attribute 'starargs'
 
```
Answer: You'll want to make sure you're using the same versions of Python and the AST library as the autograder.
Question: Some online documentation makes a big deal about generic_visit, but I don't really understand it. Can you elaborate?

Answer: To some degree, using it or not is your choice. Suppose you want to change every assignment from "x = y" to "x = y + 1".

In a language like C, you can write "a = b = 3". (Don't ever do this.) The interpretation of this statement is that you first assign b=3, and then that result (3 in this case) is also assigned into a.

If you do not do the "generic visit" thing and instead replace the top-level assignment only, you'll end up with something like this:
```
a = (b = 3) + 1
```
[ After this program, b=3 and a=4. ]

If you do the "generic visit" thing and recursively transform the child nodes as well, you'll end up with something like this:
```
 
a = (b = 3 + 1) + 1
```
[ After this program, b=4 and a=5. ]

Which one do you want? It's your choice. If you have a theory about which one is a better mutation operator, do that. If you don't care, it's safe to default to the full recursive handling of all sub-nodes.