this post was submitted on 11 Oct 2024
8 points (90.0% liked)

Artificial Intelligence

1367 readers
2 users here now

Welcome to the AI Community!

Let's explore AI passionately, foster innovation, and learn together. Follow these guidelines for a vibrant and respectful community:

You can access the AI Wiki at the following link: AI Wiki

Let's create a thriving AI community together!

founded 1 year ago
 

First of all, let me explain what "hapax legomena" is: it refers to words (and, by extension, concepts) that occurred just once throughout an entire corpus of text. An example is the word "hebenon", occurring just once within Shakespeare's Hamlet. Therefore, "hebenon" is a hapax legomenon. The "hapax legomenon" concept itself is a kind of hapax legomenon, IMO.

According to Wikipedia, hapax legomena are generally discarded from NLP as they hold "little value for computational techniques". By extension, the same applies to LLMs, I guess.

While "hapax legomena" originally refers to words/tokens, I'm extending it to entire concepts, described by these extremely unknown words.

I am a curious mind, actively seeking knowledge, and I'm constantly trying to learn a myriad of "random" topics across the many fields of human knowledge, especially rare/unknown concepts (that's how I learnt about "hapax legomena", for example). I use three LLMs on a daily basis (GPT-3, LLama and Gemini), expecting to get to know about words, historical/mythological figures and concepts unknown to me, lost in the vastness of human knowledge, but I now know, according to Wikipedia, that general LLMs won't point me anything "obscure" enough.

This leads me to wonder: are there LLMs and/or NLP models/datasets that do not discard hapax? Are there LLMs that favor less frequent data over more frequent data?

you are viewing a single comment's thread
view the rest of the comments
[–] [email protected] 2 points 2 months ago

just use a word frequency analyzer.

https://wordfrequency.org/