Plagiarism Prevention and Detection

Demonstration of CodeMatch

CodeMatch is a commercial source-code plagiarism detection tool produced as part of the CodeSuite software by Zeidman Consulting, and claims to have sophisticated algorithms for detecting plagiarism in computer source code. We used a free evaluation version of the software to run CodeMatch on a Java corpus.

Process of detecting plagiarism using CodeMatch

Initially, the user selects the folders that contain the source-code for comparison and the desired settings. Because all files for comparison where located in a single folder consisting of subfolders containing the source-code files for comparison, the same folder was selected twice (see Figure 1).

Figure 1: CodeMatch initial screen

When the user presses the ‘COMPARE’ button the comparison between the submitted files begins (see Figure 2).

Figure 2: CodeMatch - file comparison status

Results are then saved as a CodeMatch database file, and to view the results the user must convert the database into a html file. CodeMatch returns the top most similar files for each of the files submitted based on the threshold selected. For example, for each of the 51 files we submitted we received a list of the top 8 most similar files detected by CodeMatch (we have set threshold value to 8, see Figure 1). Figure 3 below shows an extract of the results from the html file, showing the first four of the 51 groups of files returned.

37.java

Score Compared To File
100 37.java
93 10.java
91 40.java
91 28.java
88 22.java
87 7.java
86 31.java
85 25.java

38.java
Score Compared To File
100 38.java
69 23.java
68 29.java
67 35.java
66 32.java
55 20.java
53 41.java
52 17.java


39.java

Score Compared To File
100 39.java
84 21.java
83 3.java
80 27.java
78 42.java
73 6.java
68 48.java
62 15.java


46.java

Score Compared To File
100 46.java
92 10.java
88 7.java
79 43.java
75 22.java
74 25.java
72 37.java
70 1.java

Figure 3: CodeMatch extract from results

CodeMatch does not have an integrated facility for viewing the entire source-code files; therefore we had to use another programming tool for viewing and visually comparing the files. After examining the top 8 most similar files returned for file 37 we found that files 37 and 10 are not a suspicious pair, but they do share some similarities due to the nature of the programming assignments. CodeMatch has mismatched file 10 as a suspicious file and gave it a score of 93. The similarities found by CodeMatch for files 37 and 10 are shown in Figure 4 below. These similarities are shown as small lines of code which makes it rather difficult for the user to gain a clear view of the similar code fragments detected and the overall similarity between the files in question.

Comparing file1 (37.java) to file2 (10.java), we have the following

Matching statements:

File1 Line#  File2 Line#  Statement
1 1  package student
3 3  import wrabble.*
5 5  public class Board implements IBoard
14 12  public Board()
17 14  setBoardSize(15)
21 18  public void clearBoard()
23
76
141
169
20
81
99
 for(int i =0
23
76
141
169
20
81
99
 i++)
25
78
21
82
100
 for(int j =0
25
78
21
82
100
 j++)
31 27  public Tile getTile(int x,int y)
44 48  public Multiplier getMultiplier(int x,int y)
50 54  public void setMultiplier(int x,int y,Multiplier multiplier)
57 60  public int getBoardSize()
63 66  public void setBoardSize(int size)
70 73  public String[] getWords()
76 81
99
 i <getBoardSize()
78 82
100
 j <getBoardSize()
80 102  Tile tile = getTile(i,j)
81 85
103
 if(tile != null)


Matching comments:
*** NONE ***

Matching instruction sequences:

File1 Line# File2 Line# Number of matching statements
9 8 11


Matching identifiers:

15 Board clearBoard getBoardSize getLetter getMultiplier getTile getWords
IBoard length Multiplier setBoardSize setMultiplier setTile size str
String student Tile wrabble        


Partially matching identifiers:

File1 Identifiers
tileboard              
File2 Identifiers
aBoard aMBoard            

Figure 4: CodeMatch results

Looking at the results for all 51 files, CodeMatch has given high values to many non-similar files. This is because the similarity values are relative rather than absolute. By looking at all the results it is impossible for academic to be selective as to which groups of files to view because the similarity values of the top 8 files returned for each file are all relatively high. We have created a chart (see Figure 5) to show the average similarity scores for each of the 51 results returned.

Figure 5: Average scores for each of the 51 files returned

Statistics of average values

Mean 78.82
Median 80.00
Mode 90.13
Range 38.75
Minimum 55.50
Maximum 94.25

Finally, in student assignments where similarities (such as identifier names, method names, similar lines of code) are likely to exist due to the nature of the programming assignment being solved and not due to plagiarism, CodeMatch appears to be rather unsuitable for pointing out the suspicious files.