Example: Find Duplicate files by name & MD5

Explanation:

The dup_script directory will be our test directory.
Inside, we have three folders: prabhu, prabhu2, and prabhu3.

Each one of them contains a
text-file-1 file with the same content and a
text-file-2 with different content in each folder.
Also, each folder contains a unique-file-x file which has both unique name and content.

-------------------------------------------------------------------------------------------------------------------------------------------------------------

Find Duplicate Files by Name

-------------------------------------------------------------------------------------------------------------------------------------------------------------

The most common way of finding duplicate files is to search by file name. We can do this using a script:

awk -F'/' ' {

f = $NF

a[f] = f in a? a[f] RS $0 : $0

b[f]++

}

{

for(x in b)

if(b[x]>1)

printf "Duplicate Filename: %s\n%s\n",x,a[x] }' <(find . -type f)

Output:

Duplicate Filename: text-file-2

./prabhu3/text-file-2

./prabhu2/text-file-2

./prabhu/text-file-2

Duplicate Filename: text-file-1

./prabhu3/text-file-1

./prabhu2/text-file-1

./prabhu/text-file-1

Script Explanation

<(find . – type f) – Firstly, we use process substitution so that the awk command can read the output of the find command

find . -type f – The find command searches for all files in the searchPath directory

awk -F’/’ – We use ‘/’ as the Field seperator(FS) of the awk command. It makes extracting the filename easier. The last field will be the filename

f = $NF – We save the filename in a variable f (NF is a predefined variable whose value is the number of fields in the current record)

a[f] = f in a? a[f] RS $0 : $0 – If the filename doesn’t exist in the associative array a[], we create an entry to map the filename to the full-path. Otherwise, we add a new line RS(Record Sepertor) and append the full path to a[f]

b[f]++ – We create another array b[] to record how many times a filename f has been found

END{for(x in b) – Finally, in the END block, we go through all entries in the array b[]

if(b[x]>1) – If the filename x has been seen more than once, that is, there are more files with this filename

printf “Duplicate Filename: %s\n%s\n”,x,a[x] – Then we print the duplicated filename x, and print all full-paths with this filename: a[x]

-------------------------------------------------------------------------------------------------------------------------------------------------------------

Find Duplicate Files by MD5

-------------------------------------------------------------------------------------------------------------------------------------------------------------

[adimulamvenkat19851609@cxln4 dup_script]$ cat>dup_md5.sh

awk '{

md5=$1

a[md5]=md5 in a ? a[md5] RS $2 : $2

b[md5]++ }

END{for(x in b)

if(b[x]>1)

printf "Duplicate Files (MD5:%s):\n%s\n",x,a[x] }' <(find . -type f -exec md5sum {} +)

Output:

[adimulamvenkat19851609@cxln4 dup_script]$ ./dup_md5.sh

Duplicate Files (MD5:eecef242ee5f52c3266ccaa38362c6c8):

./prabhu3/text-file-1

./prabhu2/text-file-1

./prabhu/text-file-1

Note: MD5 (Message Digest 5) sums can be used as a checksum to verify files or strings in a Linux file system

Q A - Z O N E

Pages

Example: Find Duplicate files by name & MD5