![]() |
Plagiarism Prevention and DetectionDemonstration of CodeMatchCodeMatch is a commercial source-code plagiarism detection tool produced as part of the CodeSuite software by Zeidman Consulting, and claims to have sophisticated algorithms for detecting plagiarism in computer source code. We used a free evaluation version of the software to run CodeMatch on a Java corpus. Process of detecting plagiarism using CodeMatchInitially, the user selects the folders that contain the source-code for comparison and the desired settings. Because all files for comparison where located in a single folder consisting of subfolders containing the source-code files for comparison, the same folder was selected twice (see Figure 1).
Figure 1: CodeMatch initial screen When the user presses the ‘COMPARE’ button the comparison between the submitted files begins (see Figure 2).
Figure 2: CodeMatch - file comparison status Results are then saved as a CodeMatch database file, and to view the results the user must convert the database into a html file. CodeMatch returns the top most similar files for each of the files submitted based on the threshold selected. For example, for each of the 51 files we submitted we received a list of the top 8 most similar files detected by CodeMatch (we have set threshold value to 8, see Figure 1). Figure 3 below shows an extract of the results from the html file, showing the first four of the 51 groups of files returned. 37.java
38.java
Figure 3: CodeMatch extract from results CodeMatch does not have an integrated facility for viewing the entire source-code files; therefore we had to use another programming tool for viewing and visually comparing the files. After examining the top 8 most similar files returned for file 37 we found that files 37 and 10 are not a suspicious pair, but they do share some similarities due to the nature of the programming assignments. CodeMatch has mismatched file 10 as a suspicious file and gave it a score of 93. The similarities found by CodeMatch for files 37 and 10 are shown in Figure 4 below. These similarities are shown as small lines of code which makes it rather difficult for the user to gain a clear view of the similar code fragments detected and the overall similarity between the files in question. Comparing file1 (37.java) to file2 (10.java), we have the following Matching statements:
Figure 4: CodeMatch results Looking at the results for all 51 files, CodeMatch has given high values to many non-similar files. This is because the similarity values are relative rather than absolute. By looking at all the results it is impossible for academic to be selective as to which groups of files to view because the similarity values of the top 8 files returned for each file are all relatively high. We have created a chart (see Figure 5) to show the average similarity scores for each of the 51 results returned.
Figure 5: Average scores for each of the 51 files returned Statistics of average values
Finally, in student assignments where similarities (such as identifier names, method names, similar lines of code) are likely to exist due to the nature of the programming assignment being solved and not due to plagiarism, CodeMatch appears to be rather unsuitable for pointing out the suspicious files. |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||




