Example: Find Duplicate files by name & MD5

|  |  +--text-file-1
|  |  |  Content: "I am not unique"
|  |  +--text-file-2
|  |  |  Content: "Some random content 1"
|  |  +--unique-file-1
|  |  |  Content: "Some unique content 1\nI am a very long line!"
|  +--prabhu2
|  |  +--text-file-1
|  |  |  Content: "I am not unique"
|  |  +--text-file-2
|  |  |  Content: "Some random content 2"
|  |  +--unique-file-2
|  |  |  Content: "Some unique content 2! \n I am a short line."
|  +--prabhu3
|  |  +--text-file-1
|  |  |  Content: "I am not unique"
|  |  +--text-file-2
|  |  |  Content: "Some random content 3"
|  |  +--unique-file-3
|  |  |  Content: "Some unique content 3\nI am an extreme long line............"


The dup_script directory will be our test directory. 
Inside, we have three folders: prabhu, prabhu2, and prabhu3. 

Each one of them contains a 
text-file-1 file with the same content and a 
text-file-2 with different content in each folder. 
Also, each folder contains a unique-file-x file which has both unique name and content.

Find Duplicate Files by Name

The most common way of finding duplicate files is to search by file name. We can do this using a script:

awk -F'/' ' {
  f = $NF
  a[f] = f in a? a[f] RS $0 : $0
for(x in b)
          printf "Duplicate Filename: %s\n%s\n",x,a[x] }' <(find . -type f)


Duplicate Filename: text-file-2
Duplicate Filename: text-file-1

Script Explanation

<(find . – type f) – Firstly, we use process substitution so that the awk command can read the output of the find command
find . -type f – The find command searches for all files in the searchPath directory

awk -F’/’ – We use ‘/’ as the Field seperator(FS) of the awk command. It makes extracting the filename easier. The last field will be the filename

f = $NF – We save the filename in a variable f (NF is a predefined variable whose value is the number of fields in the current record)

a[f] = f in a? a[f] RS $0 : $0 – If the filename doesn’t exist in the associative array a[], we create an entry to map the filename to the full-path. Otherwise, we add a new line RS(Record Sepertor) and append the full path to a[f]

b[f]++ – We create another array b[] to record how many times a filename f has been found

END{for(x in b) – Finally, in the END block, we go through all entries in the array b[]

if(b[x]>1) – If the filename x has been seen more than once, that is, there are more files with this filename

printf “Duplicate Filename: %s\n%s\n”,x,a[x] – Then we print the duplicated filename x, and print all full-paths with this filename: a[x]

Find Duplicate Files by MD5

[adimulamvenkat19851609@cxln4 dup_script]$ cat>dup_md5.sh
awk '{
  a[md5]=md5 in a ? a[md5] RS $2 : $2
  b[md5]++ }
  END{for(x in b)
          printf "Duplicate Files (MD5:%s):\n%s\n",x,a[x] }' <(find . -type f -exec md5sum {} +)

[adimulamvenkat19851609@cxln4 dup_script]$ ./dup_md5.sh

Duplicate Files (MD5:eecef242ee5f52c3266ccaa38362c6c8):

Note: MD5 (Message Digest 5) sums can be used as a checksum to verify files or strings in a Linux file system