Thursday, February 23, 2006

A note on how plagiarism.org may work

This is a continuation of the entry immediately below.

Based on the description given on the plagiarism.org web page and on the examples shown in the demo I attended, my guess about how plagiarism.org's comparison algorithm works is as follows.

Since early in the history of computing a program called diff has been available on Unix. That program compares two files and, making the best match between them, displays the minimal changes (insertions and deletions) that would transform one to the other, i.e., the differences between them. This is the same algorithm that one uses when one compares one file to another in, for example, Microsoft Word.

But plagiarism.org can't possibly run diff between a submitted paper and every site on the internet. So to cut down on the number of file-to-file comparisons they must make, they construct what they call a fingerprint of the paper. They don't say what the fingerprint consists of, but I would guess it is a list of the words in the document that are relatively infrequent. Then they look for other documents (or web pages) that include those words. They could do a Google search for that. Given the result of that search, they then run diff between each of the found documents and the submitted paper.

No comments: