Finding and deleting duplicate files @ New Zealand Linux

Finding and deleting duplicate files

This entry was posted by Liz Quilty Wednesday, 20 October, 2010

Read the rest of this entry »

Okay, so you have a huge pile of mp3s and somehow managed to copy them repeatedly somewhere and now only want one copy of each? (hey! i do this all the time copying them from machine to machine!).
Best way to check that they are “identical” is with md5sum. This is how i deal with my problem.

find ./ -type f | while read file ; do md5sum "$file" >> md5list ; done # this gives me a file called md5sum with all the filenames and their md5sum
cat md5list | awk '{print $1}' | sort | uniq -c |grep -v 1\ | awk '{print $2}' >duplist # this checks for files with duplicate md5sum 
for i in `cat duplist` ; do grep $i md5list | sed "1,1d"| sed s/$i// >>rmlist; done # this outputs a list of files minus the first/top one so we are still left with one copy
cat rmlist  | while read line ; do mkdir bin ; echo removing $line ;mv "$line" bin/; done # this moves them all to a dir called bin/ which you can remove later
echo check bin/ for any files you accidently deleted # letting you know the above!

You probably want to remove the files md5list duplist and rmlist after you are done 🙂

Category: Scripts, Tutorials
You can follow any responses to this entry via RSS.
You can leave a comment or trackback from your own site.

John van Zantvoort

June 15, 2011 at 8:55 am

You might want to look at the hardlink tool by Jakub Jelinek ( https://fedorahosted.org/hardlink/browser/hardlink.c). This way you don’t need to delete anything.

If however you do want to delete the doubles entries you could do something like:

find . -type f -links +1 -printf “%i %p\n” | sort -n

Liz Quilty

June 15, 2011 at 8:58 am

cheers 🙂

New Zealand Linux

Finding and deleting duplicate files

2 Responses to “Finding and deleting duplicate files”

Leave a Reply

Contributors

Linux

NZ LUGS

People

Recent Comments

Pages

Meta