Finding and deleting duplicate files
Read the rest of this entry »
Okay, so you have a huge pile of mp3s and somehow managed to copy them repeatedly somewhere and now only want one copy of each? (hey! i do this all the time copying them from machine to machine!).
Best way to check that they are “identical” is with md5sum. This is how i deal with my problem.
find ./ -type f | while read file ; do md5sum "$file" >> md5list ; done # this gives me a file called md5sum with all the filenames and their md5sum cat md5list | awk '{print $1}' | sort | uniq -c |grep -v 1\ | awk '{print $2}' >duplist # this checks for files with duplicate md5sum for i in `cat duplist` ; do grep $i md5list | sed "1,1d"| sed s/$i// >>rmlist; done # this outputs a list of files minus the first/top one so we are still left with one copy cat rmlist | while read line ; do mkdir bin ; echo removing $line ;mv "$line" bin/; done # this moves them all to a dir called bin/ which you can remove later echo check bin/ for any files you accidently deleted # letting you know the above! |
You probably want to remove the files md5list duplist and rmlist after you are done 🙂
You might want to look at the hardlink tool by Jakub Jelinek ( https://fedorahosted.org/hardlink/browser/hardlink.c). This way you don’t need to delete anything.
If however you do want to delete the doubles entries you could do something like:
find . -type f -links +1 -printf “%i %p\n” | sort -n
cheers 🙂