To try and make sure I get every unique photo without storing duplicates I devised a simple system to organise the photos.
Firstly I created a directory under /var/ called media. In this directory I do some set up:
# python >>> import pickle >>> pickle.dump([],"index.pickle") >>> ^D # mkdir by_hash # cd by_hash # mkdir 0 1 2 3 4 5 6 7 8 9 a b c d e fNow I'm going to write a helper to copy files in and update the index. Firstly, we import a few modules we will use. Then we assign some paths to variables, (so we can reuse them later.)
#!/usr/bin/python import sys import os.path import hashlib import pickle # Some paths and directories we use index_location = "/var/media/index.pickle" media_root = "/var/media/by_hash/%s/%s"The name of the file to index comes from the first command line argument. We calculate the length and a checksum for the file.
# Open the file and read contents file_argument = os.path.abspath(sys.argv[1]) filecontent = open(file_argument).read() filelength = len(filecontent) checksum = hashlib.sha1(filecontent).hexdigest() file_entry = (file_argument, filelength, checksum)Now we load the index and check if a file with this length and checksum are already indexed.
# Check index for any files with the same length and checksum. index = pickle.load(open(index_location)) indexed_files = set([(entry_length, entry_checksum) for entry_name, entry_length, entry_checksum in index]) indexed = (filelength, checksum) in indexed_filesIf the file isn't indexed we add it to the correct by_hash subdirectory and add it to the index.
# If no file with this length and checksum exist add it. if not indexed: open(media_root % (checksum[0], checksum),'w').write(filecontent) # Update index, (reload as it may take some time to write a large file.) index = pickle.load(open(index_location)) index.append(file_entry) pickle.dump(index, open(index_location,"w"))All finished. A short and sweet way to get one copy of each unique photo.
If it is important to capture the originating path of each copy then the index update code can be moved out of the if-block and there will be an entry for each file added.
This simple design allows for a lot of flexibility. (For example I have each of the by-hash folders symbolically linked to folders on different disk partitions to better use my available space.)
This index can now be queried directly using Python or accessed by other applications. The next post will detail creating a simple web-page which will list the index data.
No comments:
Post a Comment