Goodness knows the world doesn’t need another duplicate file finder… but here one is!
Download find_duplicates-1.0.tar.gz
This should run on Python 2.6 and up. Usage is python find_duplicates.py DIRECTORY. If you just want to copy-paste or ogle the source, here it is:
#! /bin/env python """This program finds files under a given directory which share the same contents. This program works by forming a list of one group containing the filepaths of the files the user want to inspect for duplicates. This list is examined and pruned multiple times; each time the groups are partitioned into subgroups using progressively more accurate (and slower) duplicate-checking strategies until the remaining groups represent sets of duplicate files. """ import collections import hashlib import os def make_initial_group(base_dir): """Recurse into the directory at base_dir, returning all the file's filepaths as a list within a list.""" groups = [[]] for dirpath, _, filenames in os.walk(base_dir): for filename in filenames: filepath = os.path.join(dirpath, filename) if not os.path.islink(filepath): groups[0].append(filepath) return groups def regroup_using_strategy(groups, strategy_func): """Refines the groups by using the output of strategy_func(filepath) as a key for regrouping the files. (which must be equal for duplicate files). The keys returned by strategy_func(filepath) for duplicate files must be equal. Equivalently, keys must only differ for files that are different. An example strategy_func returns the filesize for the file located at filepath. """ for group in groups: files_by_key = collections.defaultdict(list) for filepath in group: try: key = strategy_func(filepath) except IOError: continue files_by_key[key].append(filepath) for group in files_by_key.itervalues(): yield group def prune_groups(groups): """Remove groups whose size is too small to contain duplicates.""" return (group for group in groups if len(list(group)) >= 2) def print_groups(groups): """Print the filepaths in each group, separating each group by a blank line. """ for group in groups: for filepath in group: print filepath print def hash_file(filepath): """Compute and return the hash for the file located at filepath.""" with open(filepath, "rb") as file_: return hashlib.md5(file_.read()).digest() if __name__ == "__main__": import sys if len(sys.argv) != 2: sys.__stderr__.write("{0}: Usage: {0} directory\n" .format(sys.argv[0])) sys.exit(-1) base_dir = sys.argv[1] groups = make_initial_group(base_dir) groups = prune_groups(regroup_using_strategy(groups, os.path.getsize)) groups = prune_groups(regroup_using_strategy(groups, hash_file)) print_groups(groups)
This program runs quickly—it is in league with other popular tools of it’s kind, but it is still somewhat naive. I have plans for a faster algorithm which does not use hash algorithms and mitigates risk of false positives due to hash collisions.