this post was submitted on 19 Nov 2023
5 points (100.0% liked)

Self-Hosted Main

504 readers
1 users here now

A place to share alternatives to popular online services that can be self-hosted without giving up privacy or locking you into a service you don't control.

For Example

We welcome posts that include suggestions for good self-hosted alternatives to popular online services, how they are better, or how they give back control of your data. Also include hints and tips for less technical readers.

Useful Lists

founded 1 year ago
MODERATORS
 

Having been so meticulous about taking back ups, I’ve perhaps not as been as careful about where I stored them, so I now have a loads of duplicate files in various places. I;ve tried various tools fdupes, czawka etc. , but none seems to do what I want.. I need a tool that I can tell which folder (and subfolders) is the source of truth, and to look for anything else, anywhere else that’s a duplicate, and give me an option to move or delete. Seems simple enough, but I have found nothing that allows me to do that.. Does anyone know of anything ?

top 26 comments
sorted by: hot top controversial new old
[–] [email protected] 2 points 11 months ago (2 children)

Write a simple script which iterates over the files and generates a hash list, with the hash in the first column.

find . -type f -exec md5sum {} ; >> /tmp/foo

Repeat for the backup files.

Then make a third file by concatenating the two, sort that file, and run "uniq -d". The output will tell you the duplicated files.

You can take the output of uniq and de-duplicate.

[–] [email protected] 1 points 11 months ago (1 children)

Thanks @speculatrix - I wish I had your confidence in scripting - hence I’m hoping to find something that does all that clever stuff for me.. The key thing for me is to say something like multimedia/photos/ is the source of truth anything found elsewhere is a duplicate ..

[–] [email protected] 1 points 11 months ago (1 children)

I wish I had your confidence in scripting

You know how you get it? by fucking around and finding out! I'd say give it a go!

Do a dry run of the de-dup to make sure you don't delete anything you care about.

[–] [email protected] 1 points 11 months ago

Give me a few years and maybe :P - but for now I’d rather not risk important data with my own limited skills especially if there is a product out there that it’s tried and tested and hopefully recommended by someone in this sub.. I didn’t expect my ask to be quite so unique..

[–] [email protected] 1 points 11 months ago

I think you need a \ in front of the ;

i.e.: find . -type f -exec md5sum {} \; >> /tmp/foo

[–] [email protected] 2 points 11 months ago (1 children)

I've used dupeGuru on windows for cleaning up my photos, worked great for that. Has a GUI and also works on linux!
https://dupeguru.voltaicideas.net/

[–] [email protected] 1 points 11 months ago (1 children)

Thanks - I think I tried that - but at the time it had no concept of a source (location) of truth to preserve / find duplicates against - has that changed ? They don’t seem to reference that specific capability on that link ?

[–] [email protected] 1 points 11 months ago

Directories can be marked as reference directories to which other files would be considered duplicates.

[–] [email protected] 1 points 11 months ago (1 children)

Only YOU can tell which is the source of truth but czawka can easily do what you need, what issues did you have with it?

[–] [email protected] 1 points 11 months ago

I’ll have to reinstall it to remind myself what it was, if I recall correctly it was not easy to work out what I needed to do, as I simply wanted to say scan everything for duplicates that are in the (directory hierarchy e.g. multimedia/photos/) I have deemed as being the source of truth)..

[–] [email protected] 1 points 11 months ago (2 children)

How should a duplicate finder know which is the source of the duplicate?

[–] [email protected] 1 points 11 months ago (1 children)

I’d like to find something that has that capability- so I can say multimedia/photos/ is the source of truth - anything identical found elsewhere is a duplicate. I hoped this would be an easy thing to as the ask is simply to ignore any duplicates in a particular folder hierarchy..

[–] [email protected] 1 points 11 months ago

Well that's possible with a lot of deduplicators. But I'd take a look at duff:

https://manpages.ubuntu.com/manpages/xenial/man1/duff.1.html

https://github.com/elmindreda/duff

The duff utility reports clusters of duplicates in the specified files and/or directories. In the default mode, duff prints a customizable header, followed by the names of all the files in the cluster. In excess mode, duff does not print a header, but instead for each cluster prints the names of all but the first of the files it includes.

 If no files are specified as arguments, duff reads file names from stdin.
[–] [email protected] 1 points 11 months ago

Well it won't. You either tell it to assume that say oldest is always source and if there are identical files then you get asked to choose.

[–] [email protected] 1 points 11 months ago (1 children)

I don't think there is a good way to tell which two duplicate files was "first" other than checking Creation Date but if this is Linux that attribute may not be enabled in your fs type.

The closest thing I've seen is a python dedup scripts but after it identifies all the dups it deletes all but one of them and then puts hard links, to that real file, where all the deleted dups were.

[–] [email protected] 1 points 11 months ago

Hi @xewgranodius - I’m not actually worried about which came first, the key thing for me is which one is located in the directory (source) of truth. If it’s not in there then it’s fair game and can be moved/deleted..

[–] [email protected] 1 points 11 months ago

Alldup is my preferred de-duplicator, it has options to protect folders and seems like what you want but it is windows only unfortunately

[–] [email protected] 1 points 11 months ago

I think you’re asking for a duplicate finder that can tell where that file came from(source of truth?)

Most duplicate finders work by hashing the files and looking for matches. If the file indicated where it came from it would have a different hash and not be found to be a duplicate.

So I don’t think what you’re asking for can be done. But I’m not sure I understand what you’re asking.

[–] [email protected] 1 points 11 months ago (1 children)

Only runs on windows but I've been using double killer for years. Simple and does the trick

[–] [email protected] 1 points 11 months ago (1 children)

Thanks @CrappyTan69 - I ideally need this to run on my NAS, and if possible be opensource/free - looks like for what I’d need Double Killer for, it’s £15/$20 - maybe an option as a last resort..

[–] [email protected] 2 points 11 months ago

Can't you edit the OP and add the requirements? You haven't even told us what NAS you have.

[–] [email protected] 1 points 11 months ago

If you're 100% sure that the dupes are only between your source of truth and "everything else", you can run fdupes then grep -v /path/to/source/of/truth/root the output - all the file paths that remain are duplicate files outside your source of truth, which can be deleted.

[–] [email protected] 1 points 11 months ago (1 children)

czkawka can easily do this OP!

In this screenshot for example, I added 3 folders and marked the first folder as reference folder (the checkmark behind it). It will now look for files from this folder in the other folders and delete all identical files found in the non-reference folders (it will off course first list all of them and ask you to confirm before deleting)

[–] [email protected] 1 points 11 months ago

thanks. this served my use case perfectly https://github.com/qarmin/czkawka

[–] [email protected] 1 points 11 months ago

I use fclones. Not sure if this works the same as fdupes (likely does). Not sure if it will help you. It's just a thing I use. Hope it helps.

[–] [email protected] 1 points 11 months ago

I use double killer on Windows and rmlint on Linux. With rmlint you can use tagged directories and tell it to keep the tagged and only match against the tagged. It has a lot of options, but no GUI.