This can be solved fairly simply with traditional flex/bison or a PEG, using an LLM is way overkill.
AI
Artificial intelligence (AI) is intelligence demonstrated by machines, unlike the natural intelligence displayed by humans and animals, which involves consciousness and emotionality. The distinction between the former and the latter categories is often revealed by the acronym chosen.
Thanks, I'll take a look at them. While an LLM would bring autocompleted context (although possibly a hallucinated context, but it would work for me) about the hapax word/concept, I can also discover hidden hapax words in a dataset and then try to figure out their meaning.
Not the original commenter, but to add some more context. The words usually removed in traditional NLP applications are called "stop words" and are usually more "non-valuable" words like "the, and, but".
However, LLMs don't skip stop words, they actually need them to better understand the context of the sentence. That being said, LLMs are not great for statistical analysis and a simple word count would be more consistent and faster.
They cut anything but the median part of the dataset: the most frequent words, as you said, as well as words that occurred just once across the entire dataset. At least it's what Wikipedia states on Hapax legomenon:
In the fields of computational linguistics and natural language processing (NLP), esp. corpus linguistics and machine-learned NLP, it is common to disregard hapax legomena (and sometimes other infrequent words), as they are likely to have little value for computational techniques. This disregard has the added benefit of significantly reducing the memory use of an application, since, by Zipf's law, many words are hapax legomena.[13]
I thought of LLMs because they're trained on really, really big and vast datasets, datasets that we normally can't really have access, let alone use it to compute in our personal computers (mine is a 12GB RAM Linux laptop, it's a good Core i5 computer, but not enough to really big datasets). I mean, there are lots of downloadable datasets in platforms such as Kaggle and Huggingface, as well as internet archives of plain-text articles, books, BBS and so on, but I guess it's just a tiny fraction of the datasets used for OpenAI's GPT, Meta's Llama and Google's Gemini training. And I have a "gut feeling" that somewhere, somehow, those least-mentioned things (words, entire concepts, places, mythological figures and ancient deities, forgotten philosophical nomenclatures and so on) are lurking and waiting to be excavated from beneath these vast depths of datasets.
Maybe the ideal scenario would be having entire datasets and applying parsers and tokenizers to all of them (as the original comment suggested, parsers such as PEG or FLEX), then cut the slice of words/tokens that appeared just once across all of them. In order to it to properly work, there's really a need of several datasets. For example: I tried to do it with two versions of the bible (because it's an example of a long book readily available throughout the Web and ready to be parsed; I used both a JSON containing JKV verses and a JSON containing BBE verses) and I got around 3200 unique occurrences using the "Poor man's technique" I described on the other comment (Node.js + Regex + JS dictionary object to count occurrences, not the best of approaches). If I'd to add more English versions/translations, maybe this would converge to more specific unique words.
I don't think LLMs are the right tool for this. They're built to find statistically likely correlations, patterns etc. That's why they tend to give the correct answer (at least to simple questions) and why they produce legible output in the first place. And you want kind of the opposite of that. Somehing that's unlikely. But that goes against how they work. Of course you can ask them to surprise you and do something unexpected. And they'll try to do something with that. I doubt it'll do anything and I think it's a fundamental limitation. A better tool would be traditional statistics, going through datasets and counting frequencies and you can find your hapax legomena precisely. And I mean some linguists probably already studied this and you could also read their publications...
I've also fooled around with LLMs and in my experience, they don't perform well on uncommon things. If it's barely in their dataset, they'll struggle with the concept and fabricate something. I've never got any correct output from them in those cases.
(And that's not the only fundamental limitation. They also can't count the number of 'r's in strawberry until now, when someone put the correct answer into their dataset. It takes them immense effort to learn maths and do calculations, because they've been built for words. And many other things. And your question is very similar to the 'strawberry' thing. LLMs are known to fail in these cases because of how they work. At least currently.)
I’ve also fooled around with LLMs and in my experience, they don’t perform well on uncommon things. If it’s barely in their dataset, they’ll struggle with the concept and fabricate something.
Me too. They hallucinate. And sometimes I learn things through these hallucinations, when I ask them about an uncommon thing. However, they won't give the uncommon thing, I'm the one who usually feeds the prompt with the uncommon thing for them to hallucinate. Indeed, what I'm seeking is likely the exact opposite of what's expected for LLMs: the extremely uncommon, close to complete hallucination and stochastic behavior.
A better tool would be traditional statistics, going through datasets and counting frequencies and you can find your hapax legomena precisely
I'm used to do it in a laughable "poorman's way" via Node.js + RegEx + JS key-value dictionary object (whose key is the token and the value is a number that increases as this token is found via interaction), downloading some JSON/TXT/CSV dataset, reading and parsing it, then iterating over its tokens. It consumes a lot of memory, time and CPU (yet I try to use a sleep/delay between N iterations in order to free the CPU from high loads). I know there are better ways, and a temperature/param-adjustable LLM seemed for me as a better way, hoping that there's some exception across the many LLMs publicly available that wouldn't discard hapaxes.
And I mean some linguists probably already studied this and you could also read their publications
The things I'm willing to discover and learn weren't/aren't so well studied. I mean, human knowledge is a really vast universe of concepts, names and ideas, some of them got buried by time (sometimes centuries or millennia). Someone has to dig them because they could hold value, knowledge value. One of my purposes with this inquiry over the unknown is to find these really forgotten ideas and concepts, things never studied before, and try and study them, learn about them. That's how things were rediscovered throughout the entire human history: treasures are buried by the passage of time, and a curious person digs them, and humanity gets to know them once again. And a potential source of knowledge lurking in oblivion is the big data, or big datasets.
Is going trough text really that compuationally expensive? I guess the english language only uses a few thousand words frequently, plus some names and rare words. I'd imagine you can comfortably keep them in RAM next to a counter variable for each bucket. That should allow going through practically any book on earth on a regular computer, if I'm not mistaken. I'm not sure if that's I/O bound or CPU bound, but it shouldn't be that hard. It's something that gets taught in the first 3 semesters of computer science at university.
Regarding the hallucinations: There are two use-cases: If you want some creative output that doesn't need to be correct, you're fine. You'll be doing art like the people who manipulate electronic child toys and music instruments to coerce some strange sounds out of them. I think that's calles "circuit bending". You could also de-tune the parameters of an LLM, tinker around a bit and mess with the settings. Feed it random garbage prompts and see what it'll do. I guess that's an interesting arts project.
But if you want something that has to do with factuality or needs to be correct, the hallucinations will get in your way. A "hapax legomena" or unique word is a well-defined (objective) thing. It doesn't really help if the LLM returns some pretend answer. It might look interesting at first, but it won't be a unique word by real-world definition. And that's why I don't think an LLM can help in this case.
I've tried asking it the title and author of some children's story which I heard at a first communion ceremony at church. I tried googling that but all the church pages attribute the story to some random authors. So I tried asking Llama and ChatGPT but they wholehartedly make something up. I've tried like 20 times but all they return is made up and false. So it doesn't help. And I guess those more contemporary religous books just aren't in any dataset. And the LLM will just do something random in this case. As it'll do with everything that's rare or missing in the dataset (and it can't infer it).
Another thing I did (concerning language) is ask AI about idioms and figures of speech. Initially I did this because I'm not a native english speaker and figures of speech are very nice concepts. They can make your text more flowery or funny, and they always come with some interesting story of origin. But you have to learn and memorize them for later use, because they vary widely from country to country. And LLMs are really good at translating. And they indeed do well with that. And occasionally they'll hallucinate some idiom. Which can be hilarious. It won't be something that fits the definition of the term. But it definitely sparks my creativity at times. At least it makes me laugh.
And writing prose and longer stories with AI also shows their preference for likely things. They always try to push my stories towards some lame and common story arcs. Do super obvious plot twists. And lots of models (not all of them) always push towards resolving story arcs and an happy end. And it's difficult to impossible to overcome. It tends to get better with their size and "intelligence", but I don't think any of the current LLMs is close to being useful with that.
So summed up: You said in another comment, computer linguistics discards unique words because they have little value and additionally they get in the way. There is probably some reason to that, computer scientists generally aren't stupid and I suppose they tried, and put some thought into it. An LLM just can't make sense of the concept. It needs more training data to learn something. A unique word will just mess with the weights and shift them into some random direction. Likely degrading the LLM in some miniscule way. That's why they discard them. And even if they didn't do it, the LLM couldn't memorize a word if it's only there once. And if you put it into the dataset multiple times, an LLM could learn it... But it won't be unique anymore. So I don't see how it'd work. And also my experience tells me they generally don't do well with rare things.