I downloaded a lot of images in a directory.
Downloader renamed files which already exist.
I also renamed some of the files manually.

a.jpg
b.jpg
b(2).jpg
hello.jpg <-- manually renamed `b(3).jpg`
c.jpg
c(2).jpg
world.jpg <-- manually renamed `d.jpg`
d(2).jpg
d(3).jpg

How to remove duplicated ones? The result should be:

a.jpg
b.jpg
c.jpg
world.jpg

note: name doesn't matter. I just want uniq files.

11 Answers

bash 4.x

#!/bin/bash
declare -A arr
shopt -s globstar
for file in **; do [[ -f "$file" ]] || continue read cksm _ < <(md5sum "$file") if ((arr[$cksm]++)); then echo "rm $file" fi
done

This is both recursive and handles any file name. Downside is that it requires version 4.x for the ability to use associative arrays and recursive searching. Remove the echo if you like the results.

gawk version

gawk ' { cmd="md5sum " q FILENAME q cmd | getline cksm close(cmd) sub(/ .*$/,"",cksm) if(a[cksm]++){ cmd="echo rm " q FILENAME q system(cmd) close(cmd) } nextfile }' q='"' *

Note that this will still break on files that have double-quotes in their name. No real way to get around that with awk. Remove the echo if you like the results.

fdupes is the tool of your choice. To find all duplicate files (by content, not by name) in the current directory:

fdupes -r .

To manually confirm deletion of duplicated files:

fdupes -r -d .

To automatically delete all copies but the first of each duplicated file (be warned, this warning, this actually deletes files, as requested):

fdupes -r -f . | grep -v '^$' | xargs rm -v

I'd recommend to manually check files before deletion:

fdupes -rf . | grep -v '^$' > files
... # check files
xargs -a files rm -v

You can try FSLint. It has both command line and GUI interface.

How to test files having unique content?

if diff "$file1" "$file2" > /dev/null; then ...

How can we get list of files in directory?

files="$( find ${files_dir} -type f )"

We can get any 2 files from that list and check if their names are different and content are same.

#!/bin/bash
# removeDuplicates.sh
files_dir=$1
if [[ -z "$files_dir" ]]; then echo "Error: files dir is undefined"
fi
files="$( find ${files_dir} -type f )"
for file1 in $files; do for file2 in $files; do # echo "checking $file1 and $file2" if [[ "$file1" != "$file2" && -e "$file1" && -e "$file2" ]]; then if diff "$file1" "$file2" > /dev/null; then echo "$file1 and $file2 are duplicates" rm -v "$file2" fi fi done
done

For example, we have some dir:

$> ls .tmp -1
all(2).txt
all.txt
file
text
text(2)

So there are only 3 unique files.

Lets run that script:

$> ./removeDuplicates.sh .tmp/
.tmp/text(2) and .tmp/text are duplicates
removed `.tmp/text'
.tmp/all.txt and .tmp/all(2).txt are duplicates
removed `.tmp/all(2).txt'

And we get only 3 files leaved.

$> ls .tmp/ -1
all.txt
file
text(2)

I wrote this tiny script to delete duplicated files

Basically it uses a temporary file (/tmp/list.txt) to create a map of files and their hashes. Later I use that files and the magic of Unix pipes to do the rest.

The script won't delete anything but will print the commands to delete files.

mfilter.sh ./dir | bash

Hope it helps

More concise version of removing duplicated files(just one line)

young@ubuntu-16:~/test$ md5sum `find ./ -type f` | sort -k1 | uniq -w32 -d | xargs rm -fv

find_same_size.sh

#!/usr/bin/env bash
#set -x
#This is small script can find same size of files.
find_same_size(){
if [[ -z $1 || ! -d $1 ]]
then
echo "Usage $0 directory_name" ; exit $?
else
dir_name=$1;
echo "current directory is $1"
for i in $(find $dir_name -type f); do ls -fl $i
done | awk '{f="" if(NF>9)for(i=9;i<=NF;i++)f=f?f" "$i:$i; else f=$9; if(a[$5]){ a[$5]=a[$5]"\n"f; b[$5]++;} else a[$5]=f} END{for(x in b)print a[x] }' | xargs stat -c "%s %n" #For just list files fi }
find_same_size $1
young@ubuntu-16:~/test$ bash find_same_size.sh tttt/ | awk '{ if($1 !~ /^([[:alpha:]])+/) print $2}' | xargs md5sum | uniq -w32 -d | xargs rm -vf

I found an easier way to perform the same task

for i in `md5sum * | sort -k1 | uniq -w32 -d|awk '{print $2}'`; do
rm -rf $i
done

This is not what you are asking, but I think someone might find it useful when the checksums are not the same, but the name is similar (with suffix in parentheses). This script removes the files with suffixes as ("digit")

#! /bin/bash
# Warning: globstar excludes hidden directories.
# Turn on recursive globbing (in this script) or exit if the option is not supported:
shopt -s globstar || exit
for f in **
do
extension="${f##*.}"
#get only files with parentheses suffix
FILEWITHPAR=$( echo "${f%.*}".$extension | grep -o -P "(.*\([0-9]\)\..*)")
# print file to be possibly deleted
if [ -z "$FILEWITHPAR" ] ;then
:
else
echo "$FILEWITHPAR ident"
# identify if a similar file without suffix exists
FILENOPAR=$(echo $FILEWITHPAR | sed -e 's/^\(.*\)([0-9])\(.*\).*/\1\2/')
echo "$FILENOPAR exists?"
if [ -f "$FILENOPAR" ]; then
#delete file with suffix in parentheses
echo ""$FILEWITHPAR" to be deleted"
rm -Rf "$FILEWITHPAR"
else
echo "no"
fi
fi
done

Most and possibly all of the remaining answers are terribly inefficient by computing the checksum of each and every file in the directory to process.

A potentially orders of magnitude faster approach is to first get the size of each file, which is almost immediate (ls or stat), and then compute and compare the checksums only for the files having a non unique size and only keep one instance of the files sharing both their size and checksum.

Note that even though theoretically hash collisions can occur, there are not enough jpeg files on the entire Internet for a hash collision to reasonably have a chance to happen. Two files sharing both their size and checksum are identical for all intents and purposes.

See: How reliable are SHA1 sum and MD5 sums on very large files?

I recommend fclones.

Fclones is a modern duplicate file finder and remover written in Rust, available on most Linux distros and macOS.

Notable features:

supports spaces, non-ASCII and control characters in file paths
allows to search in multiple directory trees
respects .gitignore files
safe: allows to inspect the list of duplicates manually before performing any action on them
offers plenty of options for filtering / selecting files to remove or preserve
very fast

To search for duplicates in the current directory simply run:

fclones group . >dupes.txt

Then you can inspect the dupes.txt file to check if it found the right duplicates (you can also modify that list to your liking).

Finally remove/link/move the duplicate files with one of:

fclones remove <dupes.txt
fclones link <dupes.txt
fclones move target <dupes.txt
fclones dedupe <dupes.txt # copy-on-write deduplication on some filesystems

Example:

pkolaczk@p5520:~/Temp$ mkdir files
pkolaczk@p5520:~/Temp$ echo foo >files/foo1.txt
pkolaczk@p5520:~/Temp$ echo foo >files/foo2.txt
pkolaczk@p5520:~/Temp$ echo foo >files/foo3.txt
pkolaczk@p5520:~/Temp$ fclones group files >dupes.txt
[2022-05-13 18:48:25.608] fclones: info: Started grouping
[2022-05-13 18:48:25.613] fclones: info: Scanned 4 file entries
[2022-05-13 18:48:25.613] fclones: info: Found 3 (12 B) files matching selection criteria
[2022-05-13 18:48:25.614] fclones: info: Found 2 (8 B) candidates after grouping by size
[2022-05-13 18:48:25.614] fclones: info: Found 2 (8 B) candidates after grouping by paths and file identifiers
[2022-05-13 18:48:25.619] fclones: info: Found 2 (8 B) candidates after grouping by prefix
[2022-05-13 18:48:25.620] fclones: info: Found 2 (8 B) candidates after grouping by suffix
[2022-05-13 18:48:25.620] fclones: info: Found 2 (8 B) redundant files
pkolaczk@p5520:~/Temp$ cat dupes.txt
# Report by fclones 0.24.0
# Timestamp: 2022-05-13 18:48:25.621 +0200
# Command: fclones group files
# Base dir: /home/pkolaczk/Temp
# Total: 12 B (12 B) in 3 files in 1 groups
# Redundant: 8 B (8 B) in 2 files
# Missing: 0 B (0 B) in 0 files
6109f093b3fd5eb1060989c990d1226f, 4 B (4 B) * 3: /home/pkolaczk/Temp/files/foo1.txt /home/pkolaczk/Temp/files/foo2.txt /home/pkolaczk/Temp/files/foo3.txt
pkolaczk@p5520:~/Temp$ fclones remove <dupes.txt
[2022-05-13 18:48:41.002] fclones: info: Started deduplicating
[2022-05-13 18:48:41.003] fclones: info: Processed 2 files and reclaimed 8 B space
pkolaczk@p5520:~/Temp$ ls files
foo1.txt

I found a small program that really simplifies this kind of tasks: fdupes.

How to remove duplicated files in a directory?

11 Answers

bash 4.x

gawk version

find_same_size.sh

Your Answer

Sign up or log in

Post as a guest

You Might Also Like

What is the significance of II.9 in a Kingdom Hearts 3 scene?

Is there a way to craft Podzol in Minecraft?

Is there a hard limit to the number of trees able to exist in your town smaller than what is geometrically possible by the rules of tree growth?