5
submitted 4 hours ago* (last edited 4 hours ago) by [email protected] to c/[email protected]

Context: my father is a lawyer and therefore has a bajillion pdf files that were digitised, stored in a server. I’ve gotten an idea on how to do OCR in all of them.

But after that, how can I make them easily searchable? (Keep in mind that unfortunately, the directory structure is important information to classify the files, aka you may have a path like clientABC/caseAV1/d.pdf

top 6 comments
sorted by: hot top new old
[-] [email protected] 5 points 4 hours ago

I'm a fucking dolt that dabbles and picks up the gist of things pretty quick, but I'm not authority on anything, so "grain of salt":

You're already familiar with OCR so my naive approach (assuming consistent fields on the documents where you can nab name, case no., form type, blah blah) would be to populate a simple sqlite db with that data and the full paths to the files. But I can write very basic SQL queries, so for your pops you might then need to cobble together some sort of search form. Something for people that don't learn SELECT filepath FROM casedata WHERE name LIKE "%Luigi%"; because they had to manually repair their Jellyfin DB one time when a plugin made a bunch of erroneous entries >:|

[-] [email protected] 5 points 4 hours ago

Might be a little heavy handed for your needs but I've found paperless-ngx to be amazing.

[-] [email protected] 4 points 4 hours ago

My problem is paperless is the fact that it doesn’t preserve the directory structure, losing essential info

[-] [email protected] 2 points 4 hours ago

If tag/classification based and automated sorting is not the thing the end-user can live with, then Paperless-ngx isn't the solution, but if you have Nextcloud and you add both the to-be-preserved directory structure and Paperless-ngx's consume directory as external storage, you can have both with a little manual labour.

[-] [email protected] 2 points 4 hours ago

What is the server are they on?

If they are just on a windows server, then the indexing service is actually good for fast results on a network share. If it's a windows 10/11 pc, I think you need to enable classic search for it to provide results to clients over the network.

Alternatively I believe everything (the program) supports indexing network locations.

[-] [email protected] 3 points 4 hours ago* (last edited 4 hours ago)

What's a bajillion? If the OCR output is less than a few GB, which is a heck of a lot of text (like a million pages), just grepping the files is not too bad. Maybe a second or two. Otherwise you need search software. solr.apache.org is what I'm used to but there are tons of options.

this post was submitted on 19 Sep 2025
5 points (100.0% liked)

Self Hosted - Self-hosting your services.

15932 readers
48 users here now

A place to share alternatives to popular online services that can be self-hosted without giving up privacy or locking you into a service you don't control.

Rules

Important

Cross-posting

If you see a rule-breaker please DM the mods!

founded 4 years ago
MODERATORS