Archive for the ‘Python’ Category

A(nother) duplicate file finder

Thursday, June 3rd, 2010

Goodness knows the world doesn’t need another duplicate file finder… but here one is!

Download find_duplicates-1.0.tar.gz

This should run on Python 2.6 and up. Usage is python find_duplicates.py DIRECTORY. If you just want to copy-paste or ogle the source, here it is:

#! /bin/env python
"""This program finds files under a given directory which share the same
contents.
 
This program works by forming a list of one group containing the filepaths of
the files the user want to inspect for duplicates. This list is examined and
pruned multiple times; each time the groups are partitioned into subgroups
using progressively more accurate (and slower) duplicate-checking strategies
until the remaining groups represent sets of duplicate files.
"""
import collections
import hashlib
import os
 
def make_initial_group(base_dir):
    """Recurse into the directory at base_dir, returning all the file's
    filepaths as a list within a list."""
    groups = [[]]
    for dirpath, _, filenames in os.walk(base_dir):
        for filename in filenames:
            filepath = os.path.join(dirpath, filename)
            if not os.path.islink(filepath):
                groups[0].append(filepath)
    return groups
 
def regroup_using_strategy(groups, strategy_func):
    """Refines the groups by using the output of strategy_func(filepath) as a
    key for regrouping the files. (which must be equal for duplicate files). 
 
    The keys returned by strategy_func(filepath) for duplicate files must be
    equal. Equivalently, keys must only differ for files that are different. An
    example strategy_func returns the filesize for the file located at
    filepath.
    """
    for group in groups:
        files_by_key = collections.defaultdict(list)
        for filepath in group:
            try:
                key = strategy_func(filepath)
            except IOError:
                continue
            files_by_key[key].append(filepath)
        for group in files_by_key.itervalues():
            yield group
 
def prune_groups(groups):
    """Remove groups whose size is too small to contain duplicates."""
    return (group for group in groups if len(list(group)) >= 2)
 
def print_groups(groups):
    """Print the filepaths in each group, separating each group by a blank
    line.
    """
    for group in groups:
        for filepath in group:
            print filepath
        print
 
def hash_file(filepath):
    """Compute and return the hash for the file located at filepath."""
    with open(filepath, "rb") as file_:
        return hashlib.md5(file_.read()).digest()
 
 
if __name__ == "__main__":
    import sys
    if len(sys.argv) != 2:
        sys.__stderr__.write("{0}: Usage: {0} directory\n"
                .format(sys.argv[0]))
        sys.exit(-1)
    base_dir = sys.argv[1]
    groups = make_initial_group(base_dir)
    groups = prune_groups(regroup_using_strategy(groups, os.path.getsize))
    groups = prune_groups(regroup_using_strategy(groups, hash_file))
    print_groups(groups)

This program runs quickly—it is in league with other popular tools of it’s kind, but it is still somewhat naive. I have plans for a faster algorithm which does not use hash algorithms and mitigates risk of false positives due to hash collisions.

Riemann Sums with Python

Sunday, May 9th, 2010

I use this when I don’t feel like symbolically integrating (or am unable to).

def riemann_sum(function, interval, n, rule="middle"):
    start = float(interval[0])
    end = float(interval[1])
    dx = (end - start)/n
    if rule == "middle":
        x_values = (start + i*dx + dx/2 for i in range(n))
    elif rule == "left":
        x_values = (start + i*dx for i in range(n))
    elif rule == "right":
        x_values = (start + (i+1)*dx for i in range(n))
    return dx * sum(function(x) for x in x_values)

Here’s a usage example:

from math import sqrt
def func(x):
    return sqrt(x+2)
print "Interval: (1, 4), Sections: 16"
print "Left: {0}".format(riemann_sum(func, (1, 4), 16, rule="left"))
print "Midpoint: {0}".format(riemann_sum(func, (1, 4), 16))
print "Right: {0}".format(riemann_sum(func, (1, 4), 16, rule="right"))

Multiplication in Python ― without actually multiplying!

Sunday, January 31st, 2010

This is just something interesting I wrote last year. It will multiply two integers using an algorithm which is sometimes called “russian peasant multiplication”.

def peasant_multiply(a, b):
    '''Multiply the floor of 'a' and 'b' using only addition, bitshifts, and the binary AND operator.'''
    a = int(a)
    b = int(b)
    product = 0
    while a > 0:
        if a & 1:
            product = product + b
        a = a >> 1
        b = b << 1
    return product

DiceGen ― Automatic generation of Diceware- passphrases

Monday, August 10th, 2009

Diceware is a system for generating secure, word-based passphrases by using dice rolls as a source of randomness. There comes a point however, when you realize that enemy intelligence isn’t actually trying to break your encryption, and all you want is to make a strong and memorable passphrase without the fuss of rolling around small cubes.

That’s where DiceGen comes in. DiceGen is a command line program that does all the work for you. Run dicegen.py and it spits out a five word Diceware passphrase.

DiceGen is flexible: by default it uses the original Diceware wordlist to make passphrases, but it can also take simple, one-word-per-line (newline delimited) wordlists―not just Diceware wordlists. To do this, use --word-list-format=simple and --word-list-file=FILE to tell dicegen to use the wordlist at file FILE.

Download

Instructions

DiceGen is very easy to use. To generate five passphrases with ten words each, run python dicegen.py -n5 -w10. Running python dicegen.py --help gives you some more detailed usage information:

Usage: dicegen [-n] [-w]

Options:
  --version             show program's version number and exit
  -h, --help            show this help message and exit
  -n NUM, --number=NUM  number of passphrases to generate [default: 1]
  -w NUM, --words=NUM   number of words to use in passphrase [default: 5]
  --no-spaces           do not add spaces between words
  --word-list-file=FILE
                        location of a complete Diceware wordlist [default: /path/to/diceware.wordlist.asc]
  --word-list-format=FORMAT
                        how the wordlist is formatted [possible values:
                        diceware, simple] [default: diceware]

Caveots

DiceGen uses python’s random library, which uses the Mersenne twister generator. This generator is not cryptographically secure. Don’t use DiceGen if you happen to be enemies with the NSA or really, really smart people.