Example: Find Duplicate files by name & MD5

 +--dup_script
+--prabhu
|  |  +--text-file-1
|  |  |  Content: "I am not unique"
|  |  +--text-file-2
|  |  |  Content: "Some random content 1"
|  |  +--unique-file-1
|  |  |  Content: "Some unique content 1\nI am a very long line!"
|  +--prabhu2
|  |  +--text-file-1
|  |  |  Content: "I am not unique"
|  |  +--text-file-2
|  |  |  Content: "Some random content 2"
|  |  +--unique-file-2
|  |  |  Content: "Some unique content 2! \n I am a short line."
|  +--prabhu3
|  |  +--text-file-1
|  |  |  Content: "I am not unique"
|  |  +--text-file-2
|  |  |  Content: "Some random content 3"
|  |  +--unique-file-3
|  |  |  Content: "Some unique content 3\nI am an extreme long line............"


Explanation:

The dup_script directory will be our test directory. 
Inside, we have three folders: prabhu, prabhu2, and prabhu3. 

Each one of them contains a 
text-file-1 file with the same content and a 
text-file-2 with different content in each folder. 
Also, each folder contains a unique-file-x file which has both unique name and content.


-------------------------------------------------------------------------------------------------------------------------------------------------------------
Find Duplicate Files by Name

-------------------------------------------------------------------------------------------------------------------------------------------------------------
The most common way of finding duplicate files is to search by file name. We can do this using a script:

awk -F'/' ' {
  f = $NF
  a[f] = f in a? a[f] RS $0 : $0
  b[f]++ 
  } 
{
for(x in b)
        if(b[x]>1)
          printf "Duplicate Filename: %s\n%s\n",x,a[x] }' <(find . -type f)


Output:

Duplicate Filename: text-file-2
./prabhu3/text-file-2
./prabhu2/text-file-2
./prabhu/text-file-2
Duplicate Filename: text-file-1
./prabhu3/text-file-1
./prabhu2/text-file-1
./prabhu/text-file-1


Script Explanation

<(find . – type f) – Firstly, we use process substitution so that the awk command can read the output of the find command
 
find . -type f – The find command searches for all files in the searchPath directory

awk -F’/’ – We use ‘/’ as the Field seperator(FS) of the awk command. It makes extracting the filename easier. The last field will be the filename

f = $NF – We save the filename in a variable f (NF is a predefined variable whose value is the number of fields in the current record)

a[f] = f in a? a[f] RS $0 : $0 – If the filename doesn’t exist in the associative array a[], we create an entry to map the filename to the full-path. Otherwise, we add a new line RS(Record Sepertor) and append the full path to a[f]

b[f]++ – We create another array b[] to record how many times a filename f has been found

END{for(x in b) – Finally, in the END block, we go through all entries in the array b[]

if(b[x]>1) – If the filename x has been seen more than once, that is, there are more files with this filename

printf “Duplicate Filename: %s\n%s\n”,x,a[x] – Then we print the duplicated filename x, and print all full-paths with this filename: a[x]


-------------------------------------------------------------------------------------------------------------------------------------------------------------
Find Duplicate Files by MD5
-------------------------------------------------------------------------------------------------------------------------------------------------------------

[adimulamvenkat19851609@cxln4 dup_script]$ cat>dup_md5.sh
awk '{
  md5=$1
  a[md5]=md5 in a ? a[md5] RS $2 : $2
  b[md5]++ }
  END{for(x in b)
        if(b[x]>1)
          printf "Duplicate Files (MD5:%s):\n%s\n",x,a[x] }' <(find . -type f -exec md5sum {} +)
  
  

Output:
[adimulamvenkat19851609@cxln4 dup_script]$ ./dup_md5.sh

Duplicate Files (MD5:eecef242ee5f52c3266ccaa38362c6c8):
./prabhu3/text-file-1
./prabhu2/text-file-1
./prabhu/text-file-1


Note: MD5 (Message Digest 5) sums can be used as a checksum to verify files or strings in a Linux file system