# idedupebox

> Image deduplication cli tool

idedupebox is a simple tool that uses perceptual hashing (`phash`) for deduplicating images recursively in a directory, using the library [`sharp-phash`](https://www.npmjs.com/package/sharp-phash).

Full parallelisation is on the todo list, but the tool is reasonably efficient as it is (tested on ~166K images, which were handled in ~half an hour or so on a Ryzen 7900). If this is a deal breaker for you, please get in touch and I'll add better parallelisation (or even better, PR?).

## System requirements
- [Node.js](http://nodejs.org/) + `npm` (bundled by default)
- Some command-line knowledge
- Some images in a directory to deduplicates

## Installation
Install this package via `npm`:

```bash
npm install -g idedupebox
```

...you may need run the above command with `sudo`.

This will expose a command `idedupebox`.

### From source
If you'd rather install from source, do so by cloning this directory:

```bash
git clone https://codeberg.org/sbrl/idedupebox.git;
cd idedupebox;
```

Then, install dependencies:

```bash
npm install
```

Now follow the [getting started instructions below](#getting-started), replacing `idedupebox` with `src/index.mjs` - don't forget to be `cd`ed into the repository's directory.

## Getting started
This tool has 3 subcommands:

1. `dedupe`: Walks a directory recursively, hashing all images as it goes. Spits out a list of deduplicated clusters in a .jsonl file.
2. `visualise`: Uses the `.jsonl` file from `dedupe` to create a subdirectory that hard-links all the clusters together into 1 folder for manual review
3. `delete`: Deletes duplicates, leaving 1 image left per cluster (careful to have a backup!)

These subcommands should be used in this order.

To get detailed help, run this command:

```bash
idedupebox --help
```

### Examples
Some example command invocations are shown below.

Generate a duplicates file for a directory:

```bash
idedupebox dedupe --dirpath path/to/dir --output /tmp/x/20301120-duplicates.jsonl
```

Visualise an existing duplicates file:

```bash
idedupebox visualise --verbose --dirpath path/to/same_dir_as_above --filepath /tmp/x/20301120-duplicates.jsonl
```

Backup a directory:

```bash
tar -caf 20301120-backup.tar.gz path/to/same_dir_as_above
```

Dry-run a deletion of duplicates:

```bash
idedupebox delete --dirpath path/to/same_dir_as_above --filepath /tmp/x/20301120-duplicates.jsonl
```

(note: add `--force` to *actually* delete the duplicates)

> [!NOTE]
> A built-in check ensures that the last file existing on disk in each cluster is never deleted. In case a deletion is required, which file in a given duplicates cluster that is deleted is undefined. The candidates are shuffled with fisher-yates[¹](https://bost.ocks.org/mike/shuffle/)[²](https://en.wikipedia.org/wiki/Fisher%E2%80%93Yates_shuffle) algorithm

## Output format
Aside from `ascii`, there are a number of possible output formats. Their names (section headings) and example output structures are given below.

### JSONL (default)
```jsonl
{ "id": number, "filepaths": string[] }
...
```

### TSV
```tsv
filepath	cluster	phash
path/to/cat.jpg	0	base64_here
...
```

## Contributing
Contributions are very welcome - both issues and pull requests! Please mention in your pull request that you release your work under the AGPL-3.0 (see below).

<!-- See [CONTRIBUTING.md](./CONTRIBUTING.md) for a guide on what to expect when submitting a pull request or issue to this project. -->

## Licence
idedupebox is released under the GNU Affero General Public License 3.0. The full license text is included in the `LICENSE` file in this repository. Tldr legal have a [great summary](https://www.tldrlegal.com/license/gnu-affero-general-public-license-v3-agpl-3-0) of the license if you're interested.