Copied and pasted code is usually bad

But it can be hard to find, especially in a large project. So we wrote a utility - CPD - to find it for us. It's been through three major incarnations:

  • First we wrote it using a variant of Michael Wise's Greedy String Tiling algorithm (our variant is described here)
  • Then it was completely rewritten by Brian Ewins using the Burrows-Wheeler transform
  • Finally, it was rewritten by Steve Hawkins to use the Karp-Rabin string matching algorithm.

Each rewrite made it much faster, and now it can process the JDK 1.4 java.* packages in about 4 seconds (on my workstation, at least).

Here's a screenshot of CPD after running on the JDK java.lang package.

Note that CPD works with Java, C, C++, and PHP code.

If you have Java Web Start, you can run CPD by clicking here.

Here are the duplicates CPD found in the JDK 1.4 source code.

Here are the duplicates CPD found in the APACHE_2_0_BRANCH branch of Apache (just the httpd-2.0/server/ directory).

Andy Glover wrote an Ant task for CPD; here's how to use it:


<target name="cpd">
    <taskdef name="cpd" classname="net.sourceforge.pmd.cpd.CPDTask" />
    <cpd minimumTokenCount="100" outputFile="/home/tom/cpd.txt">
        <fileset dir="/home/tom/tmp/ant">
            <include name="**/*.java"/>
        </fileset>
    </cpd>
</target>

       

There's a ignoreLiterals="true" option which makes CPD ignore literal value differences when evaluating a duplicate block. This means that foo=42; and foo=43; will be seen as equivalent. You may want to run PMD with this option off to start with and then switch it on to see what it turns up. There's also a ignoreIdentifiers="true" option that does the same thing with identifiers; i.e., variable names, methods names, and so forth. The same guidelines apply.

Also, you can get verbose output from this task by running ant with the -v flag; i.e., ant -v -f mybuildfile.xml cpd.

There's also a JavaSpaces version available for splitting the CPD effort across a farm of machines. I usually post news on that here and the releases are here. This project is pretty much dead, though, since the current code is fast enough to just run it on one machine.

Suggestions? Comments? Post them here. Thanks!