Measuring Code Similarity

Deadline: Sunday 12 Dec 2021 23:59:59.

Submit via command line:

curl http://jyywiki.cn/upload \
  -F course=ISER2021 \
  -F module=PA1 \
  -F token={{your token}} \
  -F stuid={{student id}} \
  -F stuname={{name (chinese)}} \
  -F file=@{{path to your submission}}

ISER2021-PA1 提交结果

1. Background

You may have copied other's code in your undergraduate program. Our study indicate that >80% students were suspected of plagiarism in a particular programming assignment at a 985 university, and 53% are verbatim copies with negligible changes! How could you prevent this from happening (by some software engineering research tricks)? In this experiment, let's have fun being evil and identify possible assignment plagiarism by measuring code similarity.

2. The Assignment

Suppose we want to perform a similarity check on some programming assignments on Online Judge to find as many suspects of copied programs as possible. For this experiment, we assume that each program on Online Judge is a single-file C++ program that can be compiled by gcc or clang under the -std=c++17 (C++ 17 standard) and -pedantic (reject non-standard programs) options.

You can use any major programming language you are familiar with (C/C++/Java/Python/JavaScript/Scala/Rust/Go/...) to implement a command line tool codesim that runs under Linux and is used to compare the similarity of two C++ programs. Your codesim should be run from the command line, given the filenames of the two programs as arguments, and it outputs a line with a floating-point percentage representing the similarity of the code. The percentages do not need to have an exact meaning, just as high a similarity as possible to the code that is actually copied.

$ codesim --help
usage: codesim [-v|--verbose] [-h|--help] code1 code2
$ codesim foo.cc bar.cpp
99.3

We expect you to write a user-friendly command line tool that

follows the basic specifications of a command line tool. For example, your command line tool may be called by other scripts, so please do not print any additional information (such as logs) in the standard output. A better practice is to provide -v or -verbose options to print more information in verbose mode, which can also help you debug.
does not create extra temporary files in the current directory. Linux provides the mktemp family of functions, and every major programming language has an API for this.
does not generate redundant output. Your tool may be used in conjunction with other tools, so make sure that stdout outputs only one line of similarity percentage. In the case of an error, you can leave it out, but please also follow the command line tool's specification to output the error message to stderr and return a non-zero return code.

We expect you to design/choose an appropriate algorithm (possibly provided by existing tool) to implement codesim. clang will make your life easier in parsing (modern) C++ code; or you can compile the source code and analyze the assembly or binary. Modern programming languages are exceptionally complex and simple manual implementations are almost unrealistic - so try to avoid reinventing the wheel and implement your algorithm on top of existing projects.

3. The Algorithm

Code similarity/plagiarism detection is not a new problem. We encourage you to review some of the existing literatures: copy-paste behavior in software (even in the open source community) is a major cause of low quality code propagation and reduced software maintainability. There are even specific methods for detecting code plagiarism, such as the famous MOSS. We encourage you to digest the methods in existing literatures and come up with your own ideas, rather than following their original implementations.

4. Submission

Upload the following as a zip file (zip or tar).

Source code for the tool (make sure that only the source code is included and the library functions that your source code depends on are readily available; do not put files that can be generated from the source code (dependencies, binaries, etc.) into your zip archive - they may cause your zip file to exceed the size limit).
Short compilation instructions, including how dependent libraries are obtained. (Better use existing dependency management systems.)
A report in pdf format (English), briefly describing your algorithm for implementing code similarity checking and key implementation techniques. Please describe your algorithm concisely and precisely. The report should be no more than two A4 pages.

This experiment is an open experiment and there is no absolute objective criterion for judging it. We will use your tool to perform a pair-wise comparison on a set of programs, ranked by similarity, against our known facts.