But it can be hard to find, especially in a large project. So we wrote a utility - CPD - to find it for us. It's been through three major incarnations:
Each rewrite made it much faster, and now it can process the JDK 1.4 java.* packages in about 4 seconds (on my workstation, at least).
Here's a screenshot of CPD after running on the JDK java.lang package.
Note that CPD works with Java, C, C++, and PHP code.
If you have Java Web Start, you can run CPD by clicking here.
Here are the duplicates CPD found in the JDK 1.4 source code.
Here are the duplicates CPD found in the APACHE_2_0_BRANCH branch of Apache
(just the httpd-2.0/server/
directory).
Andy Glover wrote an Ant task for CPD; here's how to use it:
<target name="cpd"> <taskdef name="cpd" classname="net.sourceforge.pmd.cpd.CPDTask" /> <cpd minimumTokenCount="100" outputFile="/home/tom/cpd.txt"> <fileset dir="/home/tom/tmp/ant"> <include name="**/*.java"/> </fileset> </cpd> </target>
There's a ignoreLiterals="true"
option which makes CPD ignore literal
value differences when evaluating a duplicate block. This means that foo=42;
and foo=43;
will be seen as equivalent. You may want to run PMD with this option off to start with and
then switch it on to see what it turns up. There's also a ignoreIdentifiers="true"
option
that does the same thing with identifiers; i.e., variable names, methods names, and so forth.
The same guidelines apply.
Also, you can get verbose output from this task by running ant with the -v
flag; i.e., ant -v -f mybuildfile.xml cpd
.
There's also a JavaSpaces version available for splitting the CPD effort across a farm of machines. I usually post news on that here and the releases are here. This project is pretty much dead, though, since the current code is fast enough to just run it on one machine.
Suggestions? Comments? Post them here. Thanks!