Thursday, November 24, 2011

Easy media index.

Recently I started going through my old backup CDs. On these discs are folders of photos that I've taken on my digital camera or copies of photos from other peoples cameras. Some of the folders have place names, others have date ranges. It's quite disorganised and there are lots of duplicates.

To try and make sure I get every unique photo without storing duplicates I devised a simple system to organise the photos.

Firstly I created a directory under /var/ called media. In this directory I do some set up:
# python
>>> import pickle
>>> pickle.dump([],"index.pickle")
>>> ^D
# mkdir by_hash
# cd by_hash
# mkdir 0 1 2 3 4 5 6 7 8 9 a b c d e f
Now I'm going to write a helper to copy files in and update the index. Firstly, we import a few modules we will use. Then we assign some paths to variables, (so we can reuse them later.)
#!/usr/bin/python
import sys
import os.path
import hashlib
import pickle

# Some paths and directories we use
index_location = "/var/media/index.pickle"
media_root = "/var/media/by_hash/%s/%s"
The name of the file to index comes from the first command line argument. We calculate the length and a checksum for the file.
# Open the file and read contents
file_argument = os.path.abspath(sys.argv[1])
filecontent = open(file_argument).read()
filelength = len(filecontent)
checksum = hashlib.sha1(filecontent).hexdigest()
file_entry = (file_argument, filelength, checksum)
Now we load the index and check if a file with this length and checksum are already indexed.
# Check index for any files with the same length and checksum.
index = pickle.load(open(index_location))
indexed_files = set([(entry_length, entry_checksum) 
                     for entry_name, entry_length, entry_checksum
                     in index])
indexed = (filelength, checksum) in indexed_files
If the file isn't indexed we add it to the correct by_hash subdirectory and add it to the index.
# If no file with this length and checksum exist add it.
if not indexed:
    open(media_root % (checksum[0], checksum),'w').write(filecontent)
    # Update index, (reload as it may take some time to write a large file.)
    index = pickle.load(open(index_location))
    index.append(file_entry)
    pickle.dump(index, open(index_location,"w"))
All finished. A short and sweet way to get one copy of each unique photo.

If it is important to capture the originating path of each copy then the index update code can be moved out of the if-block and there will be an entry for each file added.

This simple design allows for a lot of flexibility. (For example I have each of the by-hash folders symbolically linked to folders on different disk partitions to better use my available space.)

This index can now be queried directly using Python or accessed by other applications. The next post will detail creating a simple web-page which will list the index data.

No comments:

Post a Comment