Homework Assignment #4 — Defect Detection

In this assignment you will use two static analysis tools to automatically detect potential defects.

The first static analysis tool is Facebook's Infer, which focuses on memory errors, leaks, race conditions, and API issues. Infer is open source.

The second one is cppcheck, which is a static tool for analyzing specifically C/C++ programs. It can be used to detect a wide range of bugs, including those related to bad practice, correctness and security. Cppcheck is open source.

You may work with a partner for this assignment. If you do you must use the same partner for all sub-components of this assignment. Use Gradescope's partner selection feature. Only one partner needs to submit the report on Gradescope, but if you both do, nothing fatal happens.

Installing, Compiling, Running and Analyzing Legacy Code

Warning: Infer Is Hard To Run

You should use the setup from HW0 to run Infer.

As an optional alternative, many users report that Facebook's Infer tool does not run on the Windows Subsystem for Linux (WSL) or similar shortcuts for using Ubuntu- or Linux-like interfaces. Headless Virtual Box configurations (instructions) are reported to work very well. Officially, however, the HW0 setup is the supported configuration for the class.

It is your responsibility to download, compile, run and analyze the subject program and associated tools (or use the precompiled one: we recommend using the precompiled version since it is known to work with the HW0 setup). Getting the code and tools to work in some manner is part of the assignment. You can post on the forum for help and compare notes bemoaning various architectures (e.g., windows vs. mac vs. linux, etc.). Ultimately, however, it is your responsibility to read the documentation for these programs and tools and use some elbow grease to make them work.

The `lighttpd` webserver

We will make use of the lighttpd webserver (pronounced "lighty"), version 1.4.17, as one subject program for this homework. A local mirror copy of lighttpd-1.4.17.tar.gz is available, but you can also get it from the original website. It is about 55,000 lines of code in about 90 files. While somewhat small for this class, some analysis tool licenses have LOC limits or scalability issues, so it was chosen as an indicative compromise.

While not as large or popular as apache, at various points lighttpd has been used by YouTube, xkcd and Wikimedia. Much like apache, old verisons of it have a number of known security vulnerabilities.

The Common Vulnerabilities and Exposures system is one approach for tracking security vulnerabilities. A CVE is basically a formal description, prepared by security experts, of a software bug that has security implications.

There are at least ten CVEs associated with lighttpd 1.4.17 tracked in various lists (such as cvedetails or mitre). For example, CVE-2014-2324 has the description "Multiple directory traversal vulnerabilities in (1) mod_evhost and (2) mod_simple_vhost in lighttpd before 1.4.35 allow remote attackers to read arbitrary files via a .. (dot dot) in the host name, related to request_check_hostname." You can dig into the information listed in, or linked from, a CVE (or just look at subsequent versions of the program where the bug is fixed!) to track down details. Continuing the above example, mod_evhost refers to source file mod_evhost.c, mod_simple_vhost refers to file mod_simple_vhost.c, and request_check_hostname is in file request.c. You will need such information when evaluating the whether or not a tool finds these security bugs.

The Apache `ActiveMQ` message broker

ActiveMQ is message broker, written in Java, which supports multiple industry standards. We will make use of it, version 5.15.0, as another subject program for this homework. A local mirror copy of activemq-version.tar.gz is available, which you can also obtain from the original website.

There are at least ten CVEs associated with activemq-5.15.0. For example, CVE-2023-46604 has the description "The Java OpenWire protocol marshaller is vulnerable to Remote Code Execution. This vulnerability may allow a remote attacker with network access to either a Java-based OpenWire broker or client to run arbitrary shell commands by manipulating serialized class types in the OpenWire protocol to cause either the client or the broker (respectively) to instantiate any class on the classpath. Users are recommended to upgrade both brokers and clients to version 5.15.16, 5.16.7, 5.17.6, or 5.18.3 which fixes this issue." You can find details about CVEs, similar to lighttpd above.

JSON-C

We will make use of the JSON-C library as one subject program for this homework. A local precomilied copy of json-c-with-build.zip is available, but you can also get it from the github repo (via git clone) . It is about 15,000 lines of code in about 70 files. (As above, we focus on slightly smaller projects for this assignment.)

JSON-C is a C library for working with JSON (JavaScript Object Notation) data. It lets C programs easily create, parse, and manipulate JSON structures like objects, arrays, strings, and numbers. It’s widely used in system software and embedded environments where C is common but JSON is the preferred data format for configuration, logging, or APIs. For this assignment, you will need to precompile JSON-C before running the statis tools. We provide a local mirror copy: here. If you'd prefer, you may also download and compile JSON-C yourself for full credit.

`JSON-C Download`

Once you have downloaded and unzipped JSON-C, you will need to compile it in your instance:

$ sudo apt update
$ sudo apt install -y cmake make gcc

# We will be putting JSON-C precompiled files in a build directory inside JSON-c

$ mkdir build
$ cd build
$ cmake -DCMAKE_EXPORT_COMPILE_COMMANDS=ON ..

You will know you did it right if you get output that looks something like this:

-- Performing Test BSYMBOLIC_WORKS - Success
-- Performing Test VERSION_SCRIPT_WORKS
-- Performing Test VERSION_SCRIPT_WORKS - Success
-- Could NOT find Doxygen (missing: DOXYGEN_EXECUTABLE) 
Warning: doxygen not found, the 'doc' target will not be included
CMake Deprecation Warning at apps/CMakeLists.txt:2 (cmake_minimum_required):
  Compatibility with CMake < 3.5 will be removed from a future version of
  CMake.

  Update the VERSION argument  value or use a ... suffix to tell
  CMake that the project does not need compatibility with older versions.


-- Wrote /home/ubuntu/json-c/build/apps_config.h
-- Configuring done (7.8s)
-- Generating done (0.1s)
-- Build files have been written to: /home/ubuntu/json-c/build

You should not be ready to run Infer and CppCheck on the project.

Facebook's `Infer`

The Infer tool is a static analyzer — it detects bugs in programs without running them. The primary website is fbinfer.com.

Unfortunately, some versions of Infer can be obnoxious to build and install, despite their handy installation guide. Also, many users report that Infer does not run on Windows Subsystem for Linux (WSL) or similar setups; a headless Virtual Box configuration (instructions) is recommended.

Instead (but see above about "your responsibility"), a precompiled, runs-on-the-HW0-setup version of Infer is available locally here. Once you have transferred and unpacked it, the main binary can be found at infer-linux-x86_64-v1.2.0/bin/infer. You can use either the pre-compiled one or compile it yourself for full credit (any version at all of Infer is full credit).

`Infer` on `lighttpd`

Once you have Infer built or downloaded, applying it to lighttpd should be as simple as:

$ cd lighttpd-1.4.17 
$ sh configure
$ /path/to/infer/bin/infer run -- make

You may see an "ASCII Art" progress bar while Infer is running. Infer should ultimately produce output similar to (but everything is fine if you get very different numbers):

Found 100 issues (console output truncated to 5, see '/home/ubuntu/hw4/lighttpd-1.4.17/infer-out/report.txt' for the full list)
                      Issue Type(ISSUED_TYPE_ID): #
                      Memory Leak(MEMORY_LEAK_C): 47
                          Dead Store(DEAD_STORE): 26
  Uninitialized Value(PULSE_UNINITIALIZED_VALUE): 25
           Null Dereference(NULLPTR_DEREFERENCE): 2

(Before you worry about getting different numbers, double-check the prose above: it is fine to get different numbers. Similarly, it is common for this tool to only report a few "types" of defects: if you only see a few "types" of defects, you are running the tool correctly, even if Cppcheck reports more "types" of defects.) You will have to read through the output carefully and analyze the reported defects. Some will be true positives (i.e., real bugs in the code) and some will be false positives (i.e., spurious warnings that do not correspond to real bugs).

`Infer` on `JSON-C`

Once you have downloaded and unzipped JSON-C to your instance, running Infer on JSON-C is similarly direct.

$ sudo apt install libjson-c-dev
$ cd json-c/build/
$ /path/to/infer/bin/infer run --compilation-database compile_commands.json
...

Found 22 issues
                      Issue Type(ISSUED_TYPE_ID): #
           Null Dereference(NULLPTR_DEREFERENCE): 9
                  Use After Free(USE_AFTER_FREE): 6
                      Memory Leak(MEMORY_LEAK_C): 5
  Uninitialized Value(PULSE_UNINITIALIZED_VALUE): 1
                          Dead Store(DEAD_STORE): 1

You can find Infer's output in the infer-out folder.

`Cppcheck`

Cppcheck, similar to Infer, is also a static program analyzer, but it is specialized to C/C++. That is, while it can detect a wide range of bugs in C/C++ code, it (as of now) cannot analyze programs written in other languages. Cppcheck focuses on detecting undefined behavior, memory errors, and other bugs while minimizing false positives. The official Cppcheck website is https://cppcheck.sourceforge.io and a local copy of the source code can be found here.

Cppcheck is designed as a local analysis tool for individual developers or integration into automated build systems. Its analysis is typically performed on-demand by developers working with the code, producing reports in formats such as plain text.

For this part of the assignment, you will run Cppcheck yourself on the lighttpd codebase and review its output directly. You will use the results for your report. You can install Cppcheck directly on your EC2 instance by running the following:

$ sudo apt-get install cppcheck

`Cppcheck` on `lighttpd`

Once you have Cppcheck installed, applying it to lighttpd should be as simple as:

$ cd lighttpd-1.4.17/
# This next command may take an hour or two! 
$ cppcheck --enable=all --inconclusive --force --std=c11 --suppress=missingIncludeSystem  . 2> cppcheck-lighttpd.log

This process may take a while! It will generate a logfile called cppcheck-lighttpd.log in your current directory. We can check the size and length of the logfile.


$ cd lighttpd-1.4.17/

# Check log file size
$ ls -lh cppcheck-lighttpd.log 
-rw-rw-r-- 1 ubuntu ubuntu 135K Jul 22 19:17 cppcheck-lighttpd.log
# Do not worry if your file is larger or smaller than 135k

# Check number of lines in log file
$ wc -l cppcheck-lighttpd.log
2516 cppcheck-lighttpd.log

# Do not worry if your file is larger or smaller than 2516 lines.

You should open this file to examine the warnings and potential issues reported by Cppcheck.

`Cppcheck` on `JSON-C`

Once you have downloaded and unzipped the JSON-C project to your instance, running Cppcheck on JSON-C is similarly direct.

$ cd json-c/
$ cppcheck --enable=all --inconclusive --std=c99 --language=c --force --suppress=missingIncludeSystem . 2> cppcheck-jsonc.log

This will generate a logfile called cppcheck-json.log in your current directory. We can check the size and length of the logfile.

$ cd json-c/

$ ls -lh cppcheck-jsonc.log 
-rw-rw-r-- 1 ubuntu ubuntu 65K Jul 24 19:10 cppcheck-jsonc.log
$ wc -l cppcheck-jsonc.log
962 cppcheck-jsonc.log

# Do not worry if your file is larger or smaller.

You should review/open this file to examine the warnings and potential issues reported by Cppcheck.

`ChatGPT`

While ChatGPT can be applied to a program's source code as a static analysis, it does not make use of traditional program semantics information (e.g., abstract syntax trees, dataflow analyses, etc.). Instead, it uses natural language processing (NLP) techniques to analyze code based on learned patterns and examples. Developers might use ChatGPT to identify potential bugs, clarify unfamiliar code, or get suggestions for improvements. You can access ChatGPT at https://chat.openai.com. In this course, you will use the free version.

Note that you can use any other generative AI assistant you prefer, as long as it is free (i.e., costs no money). This is a course requirement associated with equity between students. If you'd like to use another tool, just name it in your report and use it instead wherever this document references ChatGPT.

For this assignment, you will use ChatGPT to analyze potential issues in one source file from each of two projects: lighttpd and json-c. You will then compare and contrast ChatGPT’s output with the results reported by two traditional static analysis tools: Infer and Cppcheck. To ground this comparison, choose a source file that is flagged by Infer or Cppcheck and locate a specific bug reported by one of the tools. Provide that source code (or a relevant excerpt) to ChatGPT and ask whether it detects any bugs. Your goal is to evaluate how ChatGPT’s response compares to the static analysis report. You will use this for your final report.

Additional Subject Programs

Below are some additional subject programs that you may choose to in this homework. Note that these programs are written in different languages. You may choose one in C/C++ for CppCheck.

cpython-2.7.5 — Python interpreter
curl-7.55.0 — data transfer tool
fastjson-1.2.60 — JSON parser/generator for Java
jsoup-1.11.2 — Java HTML parser
libpng-1.6.34 — portable network graphics library

Reminder that you use "make" and friends for C programs but other build processes, like "mvn", for other languages like Java.

Note that the report requires you to choose an additional program (from the list above) and analyze it.

Written Report

You must write a detailed PDF report reflecting on your experiences with these static analysis tools. Your report must include your University email address(es). In particular, all of the following are required:

[Framing] Choose either "large software development organization" (e.g., the SQL Server group at Microsoft) or "small software development organization" (e.g., a dozen-person mobile app tech startup) — indicate your choice. You have been asked by your supervisor to evaluate two static analysis tools (Infer and Cppcheck) and prepare a recommendation: which one, if any, should our organization use? In addition, your manager has asked whether a large-language model like ChatGPT could supplement these tools. You are expected to consider this tool alongside the others, particularly in terms of usability, insightfulness, and whether it catches or misses important issues (i.e., false positives and false negatives).

[Setup] In a few sentences, describe your setup experiences with each applicable tool. (We know you did not directly set up ChatGPT.) This might include dependencies, installing it, runtime, etc.
[1 point for description]

[Usability] In a few sentences, compare and contrast your usability experiences with each Infer, Cppcheck, ChatGPT. This might include locating the reports, navigating the report or documentation website, etc.
[1 point for infer, 1 point for Cppcheck, 1 point for contrast, 1 point for other details]

[Overall] Compare and contrast the quality and details of the reports generated by Infer, Cppcheck, and ChatGPT. At a high level, what did each tool do well? How might each tool be improved? Comment on defect report categorizations (e.g., Reliability, NULL_DEREFERENCE, Security, etc.). Did you observe any "duplicate" defect reports (i.e., the same underlying issue was reported in terms of multiple different symptoms) within the same tool? How much overlap did you observe between the issues reported by the tools? What are the costs (in general, including developer time, monetary cost, risks, training, etc., and anything else mentioned at any point in class) associated with each tool? Be sure to include both standalone and supplemental use cases for ChatGPT with a focus on false positives and false negatives.
[4 points for infer, 4 points for Cppcheck, 2 points for ChatGPT, 1 point for categories, 1 point for duplicates, 1 point for overlap, 2 points for costs]

[CVE] Choose two of the CVEs associated with lighttpd. For each traditional tool (Cppcheck and Infer), describe whether or not that tool reported the issue associated with the CVE (or would otherwise have pointed you to it). You should choose one CVE such that at least one tool points out the CVE in some manner (if you find one); then, separately, you should choose one CVE such that at least one tool misses the CVE in some manner (if you find one). Overall, how effective are these tools at finding security defects?
- Students are sometimes anxious about the intended requirement for this aspect of the report. The hard requirement is that you must list two CVEs, and at least one of them must be "pointed out" by the tool.
[2 points for each CVE/tool pairing, 1 point for conclusion]

[lighttpd] Compare and contrast the defect reports produced by the tools (Cppcheck and Infer) for the lighttpd program. Which did you find more useful? Consider false positives, false negatives, and issues that you would consider to have high priority or severity. Include (copy-and-paste, screenshot, etc.) part of one report you found particularly noteworthy (good, bad, complex: your choice) and explain it.
[3 point for compare/contrast, 1 point for inlined report and analysis, 2 point for other insights]

[json-c] Compare and contrast the defect reports produced by the tools (Cppcheck and Infer) for the json-c program. Which did you find more useful? Consider false positives, false negatives, and issues that you would consider to have high priority or severity. Include (copy-and-paste, screenshot, etc.) part of one report you found particularly noteworthy (good, bad, complex: your choice) and explain it.
[3 point for compare/contrast, 1 point for inlined report and analysis, 2 point for other insights]

[chatgpt] Consider ChatGPT as a static analysis assistant. Do the following: Choose one issue reported by Infer or Cppcheck in lighttpd. Choose one issue reported by Infer or Cppcheck in json-c. For each issue: Locate the corresponding source code line(s) in the program. Then, copy and paste the code into ChatGPT and ask it whether or not it can locate any defects. Does ChatGPT report the same issue? Something different? Nothing at all? (If you have time, consider varying the prompt you use and how much of the program you give to ChatGPT as context.) Reflect: Why might ChatGPT agree, miss it, or see something new? Was the issue subtle, or does it depend on global knowledge/context? Finally, assess ChatGPT as a supplement to your chosen static analysis tool: What is the best way for your company to make use of ChatGPT to aid with defect detection, keeping in mind the issues of false positives and false negatives?
[4 points total, 1 point per issue analysis, 1 point for reflection on differences/similarities, 1 point for constructive recommendation]

[Addition] Choose an additional subject program. Compare and contrast the defect reports produced by the two traditional tools (Cppcheck and Infer) for that program.
[6 points]

[Conclusion] Conclude your report with an overall recommendation for your supervisor. Identify three important metrics or evaluation criteria and make your recommendation based on them.
[1 point for clear statement of recommendation, 1 point for clear definition of criteria, 4 points for logical support]

-1 point — Submission appears to be AI-generated without meaningful revision, at the grader’s discretion. Note: Written Report and Contribution Reflection will be assessed by separate grading teams.

The grading staff will select a small number of excerpts from particularly high-quality or instructive reports and share them with the class. If your report is selected you will receive extra credit.

Students are often anxious about a particular length requirement for this report. Unfortunately, some students include large screenshots and others do not, so raw length counts are not as useful as one might hope. Instead, I will say that in HW4 (and HW6, upcoming) we often see varying levels "insight" or "critical thinking" from students. I know that's the sort of wishy-washy phrasing that students hate to hear ("How can I show insight?"). But some of the questions (e.g., "what does cost mean in this report?") highlight places where some students give one direct answer and some students consider many nuances. Often considering many nuances is a better fit (but note that if you make things too long you lose points for not being verbose or readable -- yes, this is tough).

Let us consider an example from the previous homework. Suppose we had asked you whether mutation testing worked or not. Some students might be tempted to respond with something like "Yes, mutation testing worked because it put test suite A ahead of test suite B, and we know A is better than B because it has more statement coverage." That's a decent answer ... but it overlooks the fact that statement coverage is not actually the ground truth. (It is somewhat akin to saying "yes, we know the laser range finder is good because it agrees with my old bent ruler".) Students who give that direct answer get most of the credit, but students who explain that nuance, perhaps finding some other ways to indicate whether mutation testing worked or not, and what that even means, would get the most credit (and will also have longer reports). Students are often concerned about length, but from a grading perspective, the real factor is the insight provided.

HW4b — Collaboration Reflection

You must also complete a short written reflection about how you collaborated with others and how you made decisions about correctness in this assignment. Download the reflection template, edit it with your answers, and submit the completed file via the autograder.io server.

Important: The reflection report is graded independently from the written report, and by a separate grading team. If you report using AI tools in the reflection report, this will not affect your HW grades. You will only lose points in your written report if your report appears to be AI-generated without meaningful revision.

Submission

Submit a single PDF report (HW4a) via Gradescope. You must also submit a separate contribution reflection (HW4b) via the autograder. Your PDF report must include your name and UM email ID (as well as your partner's name and email ID, if applicable). The reflection must be submitted individually.

There is no explicit format (e.g., for headings or citations) required. For example, you may either use an essay structure or a point-by-point list of question answers.

FAQ and Troubleshooting

In this section we detail previous student issues and resolutions:

Question: When I try to run infer on lighttpd, it dies when trying to build the first file with an error like:
```
External Error: *** capture command failed:
*** make
*** existed with code 2
Run the command again with `--keep-going` to try and ignore this error.
```
Answer: Some students have reported that being careful to run all of the commands, such as with this exact sequences, works:
```
tar xzf infer-*tar.gz 
tar xzf lighttpd-1.4.17.tar.gz
cd lighttpd-1.4.17
sh configure
../infer-*/infer/bin/infer run -- make
```
Question: When I try to run infer, I get some output but then Fatal error: out of memory. What can I do?

Answer: You may need to assign your EC2 setup more memory (see HW0 for setup). You may also need to choose a different subject progam. Some students have reported this when analyzing cpython — perhaps a different program would work for you.
Question: When I try to run infer on libpng, it dies when trying to build the first file with an error like:
```
External Error: *** capture command failed:
*** make
*** existed with code 2
Run the command again with `--keep-going` to try and ignore this error.
```
Answer: One student reported that being careful to install all of the required build utilities, such as with this exact sequences, resolved the issue:
```
sudo apt install make
sudo apt install python-to-python3
```
Question: When I try to run infer on a program (e.g., lighttpd), it seems to produce no reports or output when I run infer run -- make. Instead, if I look very carefully at the output, hidden near the bottom is a warning like:
```
** Error running the reporting script: 
```
Answer: You must have your directories set up so that infer/bin/infer is "next to" other files like infer/lib/python/report.py. Infer uses those extra scripts to actually generate human-readable reports. If you tried to copy the infer binary somewhere else, it won't work. Make sure you have all of the components of infer in consistent locations.
Question: I'm not certain why "false positives" and "false negatives" are relevant for comparing the tools. I'm also not certain how we tell if something is a false positive or a false negative. Can you elaborate?

Answer: We can elaborate a bit, but I will note that this aspect of the assignment is assessing your mastery of course concepts. That is, why false positives and false negative might be important, and how to distinguish between them, are critical software engineering concepts and might come up on the exam as well. You may want to double-check your notes on these, including on the readings. Now for more detail:

Suppose you are able to determine the false positive rate of one tool — or approximate it. For example, suppose you find that Tool #1 produces twice as many false positives as Tool #2. Well, then you might combine that with some of the reading for the class. For example, the FindBugs reading notes "Our ranking and false positive suppression mechanisms are crucial to keeping the displayed warnings relevant and valuable, so that users don’t start ignoring the more recent, important war" (among other comments on false alarms), while the Coverity reading notes "False positives do matter. In our experience, more than 30% easily cause problems. People ignore the tool. True bugs get lost in the false. A vicious cycle starts where ..." among other comments on false alarms. You might also check out the Parnin and Orso reading, and so on.

Something similar could be considered for false negatives. To give a prose example rather than a reading list this time, a report might include a claim like: "Many developers will dislike a tool that claims to find Race Conditions but actually misses 99% of them. If the tool has that many false negatives, developers will feel they cannot gain confidence in the quality of the software and will instead turn to other techniques, such as testing, that increase confidence in quality assurance." I'm not saying that is a good or a bad argument, but it is an example of the sort of analytic text or line of reasoning that might be applicable here.

Students often wonder: "How do I know if the tool is missing a bug?" Unfortunately, that's a real challenge. There are at least two ways students usually approach that problem, and both require labor or effort. Similarly, determining if a report is a false alarm usually requires reading it and comprehending the code nearby.

I can't really say much more here without giving away too much of what we expect from you on this part of the assignment, but I can reiterate the soundness and completeness (false positives and false negatives) are significant concepts in EECS 481 and that you should include them, coupled with your knowledge of the human element of such tools, in your assessment of the tools.

Homework Assignment #4 — Defect Detection

Installing, Compiling, Running and Analyzing Legacy Code

The lighttpd webserver

The Apache ActiveMQ message broker

JSON-C

JSON-C Download

Facebook's Infer

Infer on lighttpd

Infer on JSON-C

Cppcheck

Cppcheck on lighttpd

Cppcheck on JSON-C

ChatGPT