Archive for the ‘Algorithms’ Category

A(nother) duplicate file finder

Thursday, June 3rd, 2010

Goodness knows the world doesn’t need another duplicate file finder… but here one is!

Download find_duplicates-1.0.tar.gz

This should run on Python 2.6 and up. Usage is python find_duplicates.py DIRECTORY. If you just want to copy-paste or ogle the source, here it is:

#! /bin/env python
"""This program finds files under a given directory which share the same
contents.
 
This program works by forming a list of one group containing the filepaths of
the files the user want to inspect for duplicates. This list is examined and
pruned multiple times; each time the groups are partitioned into subgroups
using progressively more accurate (and slower) duplicate-checking strategies
until the remaining groups represent sets of duplicate files.
"""
import collections
import hashlib
import os
 
def make_initial_group(base_dir):
    """Recurse into the directory at base_dir, returning all the file's
    filepaths as a list within a list."""
    groups = [[]]
    for dirpath, _, filenames in os.walk(base_dir):
        for filename in filenames:
            filepath = os.path.join(dirpath, filename)
            if not os.path.islink(filepath):
                groups[0].append(filepath)
    return groups
 
def regroup_using_strategy(groups, strategy_func):
    """Refines the groups by using the output of strategy_func(filepath) as a
    key for regrouping the files. (which must be equal for duplicate files). 
 
    The keys returned by strategy_func(filepath) for duplicate files must be
    equal. Equivalently, keys must only differ for files that are different. An
    example strategy_func returns the filesize for the file located at
    filepath.
    """
    for group in groups:
        files_by_key = collections.defaultdict(list)
        for filepath in group:
            try:
                key = strategy_func(filepath)
            except IOError:
                continue
            files_by_key[key].append(filepath)
        for group in files_by_key.itervalues():
            yield group
 
def prune_groups(groups):
    """Remove groups whose size is too small to contain duplicates."""
    return (group for group in groups if len(list(group)) >= 2)
 
def print_groups(groups):
    """Print the filepaths in each group, separating each group by a blank
    line.
    """
    for group in groups:
        for filepath in group:
            print filepath
        print
 
def hash_file(filepath):
    """Compute and return the hash for the file located at filepath."""
    with open(filepath, "rb") as file_:
        return hashlib.md5(file_.read()).digest()
 
 
if __name__ == "__main__":
    import sys
    if len(sys.argv) != 2:
        sys.__stderr__.write("{0}: Usage: {0} directory\n"
                .format(sys.argv[0]))
        sys.exit(-1)
    base_dir = sys.argv[1]
    groups = make_initial_group(base_dir)
    groups = prune_groups(regroup_using_strategy(groups, os.path.getsize))
    groups = prune_groups(regroup_using_strategy(groups, hash_file))
    print_groups(groups)

This program runs quickly—it is in league with other popular tools of it’s kind, but it is still somewhat naive. I have plans for a faster algorithm which does not use hash algorithms and mitigates risk of false positives due to hash collisions.

Riemann Sums with Python

Sunday, May 9th, 2010

I use this when I don’t feel like symbolically integrating (or am unable to).

def riemann_sum(function, interval, n, rule="middle"):
    start = float(interval[0])
    end = float(interval[1])
    dx = (end - start)/n
    if rule == "middle":
        x_values = (start + i*dx + dx/2 for i in range(n))
    elif rule == "left":
        x_values = (start + i*dx for i in range(n))
    elif rule == "right":
        x_values = (start + (i+1)*dx for i in range(n))
    return dx * sum(function(x) for x in x_values)

Here’s a usage example:

from math import sqrt
def func(x):
    return sqrt(x+2)
print "Interval: (1, 4), Sections: 16"
print "Left: {0}".format(riemann_sum(func, (1, 4), 16, rule="left"))
print "Midpoint: {0}".format(riemann_sum(func, (1, 4), 16))
print "Right: {0}".format(riemann_sum(func, (1, 4), 16, rule="right"))

Multiplication in Python ― without actually multiplying!

Sunday, January 31st, 2010

This is just something interesting I wrote last year. It will multiply two integers using an algorithm which is sometimes called “russian peasant multiplication”.

def peasant_multiply(a, b):
    '''Multiply the floor of 'a' and 'b' using only addition, bitshifts, and the binary AND operator.'''
    a = int(a)
    b = int(b)
    product = 0
    while a > 0:
        if a & 1:
            product = product + b
        a = a >> 1
        b = b << 1
    return product