Archive for the ‘Computers’ Category

A(nother) duplicate file finder

Thursday, June 3rd, 2010

Goodness knows the world doesn’t need another duplicate file finder… but here one is!

Download find_duplicates-1.0.tar.gz

This should run on Python 2.6 and up. Usage is python find_duplicates.py DIRECTORY. If you just want to copy-paste or ogle the source, here it is:

#! /bin/env python
"""This program finds files under a given directory which share the same
contents.
 
This program works by forming a list of one group containing the filepaths of
the files the user want to inspect for duplicates. This list is examined and
pruned multiple times; each time the groups are partitioned into subgroups
using progressively more accurate (and slower) duplicate-checking strategies
until the remaining groups represent sets of duplicate files.
"""
import collections
import hashlib
import os
 
def make_initial_group(base_dir):
    """Recurse into the directory at base_dir, returning all the file's
    filepaths as a list within a list."""
    groups = [[]]
    for dirpath, _, filenames in os.walk(base_dir):
        for filename in filenames:
            filepath = os.path.join(dirpath, filename)
            if not os.path.islink(filepath):
                groups[0].append(filepath)
    return groups
 
def regroup_using_strategy(groups, strategy_func):
    """Refines the groups by using the output of strategy_func(filepath) as a
    key for regrouping the files. (which must be equal for duplicate files). 
 
    The keys returned by strategy_func(filepath) for duplicate files must be
    equal. Equivalently, keys must only differ for files that are different. An
    example strategy_func returns the filesize for the file located at
    filepath.
    """
    for group in groups:
        files_by_key = collections.defaultdict(list)
        for filepath in group:
            try:
                key = strategy_func(filepath)
            except IOError:
                continue
            files_by_key[key].append(filepath)
        for group in files_by_key.itervalues():
            yield group
 
def prune_groups(groups):
    """Remove groups whose size is too small to contain duplicates."""
    return (group for group in groups if len(list(group)) >= 2)
 
def print_groups(groups):
    """Print the filepaths in each group, separating each group by a blank
    line.
    """
    for group in groups:
        for filepath in group:
            print filepath
        print
 
def hash_file(filepath):
    """Compute and return the hash for the file located at filepath."""
    with open(filepath, "rb") as file_:
        return hashlib.md5(file_.read()).digest()
 
 
if __name__ == "__main__":
    import sys
    if len(sys.argv) != 2:
        sys.__stderr__.write("{0}: Usage: {0} directory\n"
                .format(sys.argv[0]))
        sys.exit(-1)
    base_dir = sys.argv[1]
    groups = make_initial_group(base_dir)
    groups = prune_groups(regroup_using_strategy(groups, os.path.getsize))
    groups = prune_groups(regroup_using_strategy(groups, hash_file))
    print_groups(groups)

This program runs quickly—it is in league with other popular tools of it’s kind, but it is still somewhat naive. I have plans for a faster algorithm which does not use hash algorithms and mitigates risk of false positives due to hash collisions.

Riemann Sums with Python

Sunday, May 9th, 2010

I use this when I don’t feel like symbolically integrating (or am unable to).

def riemann_sum(function, interval, n, rule="middle"):
    start = float(interval[0])
    end = float(interval[1])
    dx = (end - start)/n
    if rule == "middle":
        x_values = (start + i*dx + dx/2 for i in range(n))
    elif rule == "left":
        x_values = (start + i*dx for i in range(n))
    elif rule == "right":
        x_values = (start + (i+1)*dx for i in range(n))
    return dx * sum(function(x) for x in x_values)

Here’s a usage example:

from math import sqrt
def func(x):
    return sqrt(x+2)
print "Interval: (1, 4), Sections: 16"
print "Left: {0}".format(riemann_sum(func, (1, 4), 16, rule="left"))
print "Midpoint: {0}".format(riemann_sum(func, (1, 4), 16))
print "Right: {0}".format(riemann_sum(func, (1, 4), 16, rule="right"))

Multiplication in Python ― without actually multiplying!

Sunday, January 31st, 2010

This is just something interesting I wrote last year. It will multiply two integers using an algorithm which is sometimes called “russian peasant multiplication”.

def peasant_multiply(a, b):
    '''Multiply the floor of 'a' and 'b' using only addition, bitshifts, and the binary AND operator.'''
    a = int(a)
    b = int(b)
    product = 0
    while a > 0:
        if a & 1:
            product = product + b
        a = a >> 1
        b = b << 1
    return product

DiceGen ― Automatic generation of Diceware- passphrases

Monday, August 10th, 2009

Diceware is a system for generating secure, word-based passphrases by using dice rolls as a source of randomness. There comes a point however, when you realize that enemy intelligence isn’t actually trying to break your encryption, and all you want is to make a strong and memorable passphrase without the fuss of rolling around small cubes.

That’s where DiceGen comes in. DiceGen is a command line program that does all the work for you. Run dicegen.py and it spits out a five word Diceware passphrase.

DiceGen is flexible: by default it uses the original Diceware wordlist to make passphrases, but it can also take simple, one-word-per-line (newline delimited) wordlists―not just Diceware wordlists. To do this, use --word-list-format=simple and --word-list-file=FILE to tell dicegen to use the wordlist at file FILE.

Download

Instructions

DiceGen is very easy to use. To generate five passphrases with ten words each, run python dicegen.py -n5 -w10. Running python dicegen.py --help gives you some more detailed usage information:

Usage: dicegen [-n] [-w]

Options:
  --version             show program's version number and exit
  -h, --help            show this help message and exit
  -n NUM, --number=NUM  number of passphrases to generate [default: 1]
  -w NUM, --words=NUM   number of words to use in passphrase [default: 5]
  --no-spaces           do not add spaces between words
  --word-list-file=FILE
                        location of a complete Diceware wordlist [default: /path/to/diceware.wordlist.asc]
  --word-list-format=FORMAT
                        how the wordlist is formatted [possible values:
                        diceware, simple] [default: diceware]

Caveots

DiceGen uses python’s random library, which uses the Mersenne twister generator. This generator is not cryptographically secure. Don’t use DiceGen if you happen to be enemies with the NSA or really, really smart people.

File carving OpenDocument text files (.odt)

Friday, July 4th, 2008

I recently destroyed (on accident, of course) someone’s file-system. Big time bummer. I needed to recover some files that were made with OpenOffice. The file carving route made the most sense, so I used Bless to inspect a collection of OpenOffice files I had around and made these definitions to work with Foremost and Scalpel. They worked like a charm! Here they are for your benefit:

#---------------------------------------------------------------------
# OPENOFFICE FILES
#---------------------------------------------------------------------
	odt	y	10000000	PK????????????????????????????mimetypeapplication/vnd.oasis.opendocument.textPK	META-INF/manifest.xmlPK????????????????????
	ods	y	10000000	PK????????????????????????????mimetypeapplication/vnd.oasis.opendocument.spreadsheetPK	META-INF/manifest.xmlPK????????????????????
	odp	y	10000000	PK????????????????????????????mimetypeapplication/vnd.oasis.opendocument.presentationPK	META-INF/manifest.xmlPK????????????????????
	odg	y	10000000	PK????????????????????????????mimetypeapplication/vnd.oasis.opendocument.graphicsPK	META-INF/manifest.xmlPK????????????????????
	odc	y	10000000	PK????????????????????????????mimetypeapplication/vnd.oasis.opendocument.chartPK	META-INF/manifest.xmlPK????????????????????
	odf	y	10000000	PK????????????????????????????mimetypeapplication/vnd.oasis.opendocument.formulaPK	META-INF/manifest.xmlPK????????????????????
	odi	y	10000000	PK????????????????????????????mimetypeapplication/vnd.oasis.opendocument.imagePK	META-INF/manifest.xmlPK????????????????????
	odm	y	10000000	PK????????????????????????????mimetypeapplication/vnd.oasis.opendocument.text-masterPK	META-INF/manifest.xmlPK????????????????????
	sxw	y	10000000	PK????????????????????????????mimetypeapplication/vnd.sun.xml.writerPK	META-INF/manifest.xmlPK????????????????????

XML Character Entities Cheat Sheet

Sunday, September 23rd, 2007

That key on your keyboard that you probably use for quotations and apostrophes isn’t actually a quotation or an apostrophe key! It’s a remnant from the typewriter era where typewriters had only a small subset of characters that were normally available in a printer’s type case. With computers and the era of desktop publishing, we’ve gone far beyond typewriters and can now be our own typesetters—you just have to know when and how to use the right characters. And trust me, your pages will look a lot spicier and much more professional when you get these down.

When writing your XML/XHTML documents you’ll want to type in these codes to get the character you want to display properly. These are called XML decimal character entities. It’s quite a mouthful, but I’ve found that they have greater support in the wild when compared to the named entities you might be familiar with such as &#ldquo;. Keep in mind that a lot of the especially tricky characters outlined below have important rules. Don’t be intimidated, just do it! Using the proper characters will spice up your documents!

Here’s a helpful little reference that I use all the time!

[—] em dash (&#8212;)
• breaks in thought
• to enclose a clause like with parentheses
• open ranges
• century vague years
• two em dashes for missing letters
• three em dashes for missing words

[–] en dash (&#8211;)
• closed numerical ranges
• indicating a connection
• showing joint authors
• compounding a hyphenation with an adjective

[‒] figure dash (&#8210;)
• linking numbers together which are not a range
• telephone numbers

[−] minus sign (&#8722;)
• subtraction

[‐] hyphen (&#8208;)
• joining compound words

[“] left double quotation mark (&#8220;)
• exact quotations

[”] right double quotation mark (&#8221;)
• exact quotations

[‘] left single quotation mark (&#8216;)

[’] right single quotation mark (&#8217;)
• preferred character for use as an apostrophe

[…] ellipsis (&#8230;)
• before periods for one or more missing words
• after periods for one or more missing sentences
• with no periods for trailing thought

A lot of this information came from a great A List Apart article titled “The Trouble With EM ’n EN (and Other Shady Characters)” by Peter Sheerin as well as from a Wikipedia page on Dashes.