Plagiarism Prevention and Detection

Demonstration of JPlag

JPlag is a plagiarism detection tool aiming to detect similarities among source code files. JPlag was developed by Guido Malpohl in 1996. It currently supports Java, C#, C, C++, Scheme, and natural language text. JPlag is free but users are required to create an account. JPlag uses a variation of the Karp-Rabin comparison algorithm developed by Wise, but adds different optimizations for improving its run time efficiency.

Process of detecting plagiarism using JPlag

JPlag compares source-code files submitted in folders or as single files. If each student’s work is stored as a separate folder then JPlag will return a similarity score between the folders that contain the suspicious files. If, each student’s work is stored as a single file, then JPlag will return a similarity score between the suspicious files detected.

The user can begin the comparison process by selecting the folder that contains the student work to be compared (see Figure 1). Each student’s work was stored as a separate folder in the CS123Ass1 subdirectory.

Figure 1: JPlag Submission Selection

Pressing the ‘Submit’ button begins the comparison between the files (see Figure 2).

Figure 2: JPlag Detection Process

The results are displayed in a form of a histogram, as shown in Figure 3. One of the submissions has not parsed and hence JPlag could not include this submission in the comparison. This is the main drawback of JPlag, if files do not parse they are not included in the comparison.

Search Results
Title: CS123
Directory: C:\CS123\CS123Ass1
Programs: 011 - 062 - 126 - 128 - 165 - 273 - 318 - 373 - 426 - 481 - 802 - 812 - 886 - 912 - 955 - 995
Language: Java1.2 Parser
Submissions: 16 (1 has not been parsed successfully)
Invalid submissions (see log file) : 527
Matches displayed: 20 (Treshold: 50.0%) (average similarity)
20 (Treshold: 52.0%) (maximum similarity)
Date: 2007-07-04
Minimum Match Length (sensitivity): 7
Suffixes: .java, .jav, .JAVA, .JAV

Distribution:

90% - 100% 1 ##
80% - 90% 0 .
70% - 80% 5 ###########
60% - 70% 6 #############
50% - 60% 8 #################
40% - 50% 14 ##############################
30% - 40% 18 #######################################
20% - 30% 34 ###########################################################################
10% - 20% 22 ################################################
0% - 10% 12 ##########################

Matches sorted by average similarity ( What is this?):

812 -> 165
(100.0%)
912
(79.4%)
426
(74.5%)
273
(67.2%)
011
(58.4%)
126
(51.4%)
165 -> 912
(79.4%)
426
(74.5%)
273
(67.2%)
011
(58.4%)
126
(51.4%)
 
011 -> 955
(76.4%)
273
(64.2%)
426
(59.8%)
912
(56.7%)
   
426 -> 912
(69.7%)
273
(69.1%)
       
273 -> 912
(63.1%)
955
(50.7%)
       
128 -> 126
(50.0%)
         

Figure 3: JPlag results

Now, assume that we want to check for similarities between folders 812 and 165 (which represent student ID numbers) we simply click on 165 and the results are quickly displayed (see Figure 4).

Figure 4: JPlag suspicious file comparison screen

The total number of matched tokens is shown in the small table in the top right, and the user can click on them to view the similar code fragments corresponding to the tokens. Each similar code fragment found is represented by a different colour and the black arrows are clickable such that the user can jump to the similar code fragment found – this is particularly useful when the similar code fragments don’t share the same position in suspicious files and the user can click on the arrow to align the suspicious code fragments.

If each student’s work is stored as a single file, then JPlag will return a similarity score between the suspicious files detected. The results are shown below in Figure 5.

Search Results
Title: NEW
Directory: C:\CS123\JPLAG
Programs: 1.java - 10.java - 11.java - 12.java - 13.java - 14.java - 15.java - 16.java - 17.java - 18.java - 19.java - 2.java - 20.java - 21.java - 22.java - 23.java - 24.java - 25.java - 26.java - 27.java - 28.java - 29.java - 3.java - 30.java - 31.java - 32.java - 33.java - 34.java - 35.java - 36.java - 37.java - 38.java - 39.java - 4.java - 42.java - 43.java - 44.java - 45.java - 46.java - 47.java - 48.java - 49.java - 5.java - 50.java - 51.java - 6.java - 7.java - 8.java - 9.java
Language: Java1.2 Parser
Submissions: 49 (2 have not been parsed successfully)
Invalid submissions (see log file) : 40.java - 41.java
Matches displayed: 20 (Treshold: 92.3%) (average similarity)
20 (Treshold: 94.0%) (maximum similarity)
Date: 2007-07-04
Minimum Match Length (sensitivity): 7
Suffixes: .java, .jav, .JAVA, .JAV

Distribution:

90% - 100% 27 ##
80% - 90% 8 #
70% - 80% 6 #
60% - 70% 13 #
50% - 60% 14 #
40% - 50% 24 ##
30% - 40% 50 #####
20% - 30% 142 ###############
10% - 20% 207 ######################
0% - 10% 685 ###########################################################################

Matches sorted by average similarity ( What is this?):

29.java -> 20.java
(100.0%)
35.java
(100.0%)
23.java
(100.0%)
28.java -> 22.java
(100.0%)
37.java
(98.0%)
34.java
(96.2%)
25.java
(96.1%)
19.java
(92.3%)
42.java -> 21.java
(100.0%)
       
35.java -> 20.java
(100.0%)
23.java
(100.0%)
     
30.java -> 24.java
(100.0%)
       
23.java -> 20.java
(100.0%)
       
39.java -> 27.java
(99.5%)
       
22.java -> 37.java
(98.0%)
34.java
(96.2%)
25.java
(96.1%)
   
17.java -> 14.java
(96.1%)
       
37.java -> 34.java
(95.5%)
25.java
(94.1%)
     

Figure 5: JPlag results histogram

The user can view the suspicious file pairs using the histogram. For example, in the first grouping of results, shown in Figure 6, clicking on file 20 opens a new window displaying files 29 and 20 and the suspicious source-code fragments clearly highlighted. An extract of the results is shown in Figure 6.

Figure 6: JPlag suspicious file comparison screen

JPlag has worked very fast, and returned only the suspicious pairs of files. This tool allows the user to view the entire detected files clearly and provide a clear indication of the suspicious source-code fragments by colour indicating the suspicious fragments. It also provides easy navigation for viewing the suspicious source-code fragments between files that contain similar code that has been rearranged in different positions.

One of the drawbacks of JPlag is that it cannot handle files which do not parse. Because of this JPlag has missed the suspicious file pair 37 and 40. A nice feature of JPlag is that it displays the groupings of suspicious files found which is a very useful feature for catching suspected plagiarism between groups of students. A minor drawback is the display of results, consider for example in Figure 5 the 2nd, 8th and 10th grouping which show the similar files detected for files 28, 22 and 27 - these groupings are also shown in the Figure 7 below. One may find that groupings 22 and 37 are rather repetitive, since the files listed in these groupings are also listed in group 28. In larger datasets where more suspicious groupings are detected these extra groupings may make the results list pointlessly lengthy.

28.java -> 22.java
(100.0%)
37.java
(98.0%)
34.java
(96.2%)
25.java
(96.1%)
19.java
(92.3%)
22.java -> 37.java
(98.0%)
34.java
(96.2%)
25.java
(96.1%)
37.java -> 34.java
(95.5%)

25.java
(94.1%)

Figure 7: Figure 3 extract

Overall, JPlag was very straightforward to use, and allowed for easy comparison and viewing of the suspicious files. The user interface displaying the results was easy to use with suspicious code fragments clearly highlighted allowing the academic to easily compare and reach a decision about the files in question.