To indulge the thought I think you could get a fair way using LLMs and vision models (for body language) by looking at the vector space differences between human languages (and idiomatic [e.g. two fingers in different cultures] and intrinsic [e.g. smiles] gestures), a large set of annotated animal vocalizations and gestures, plugging it into a really hot cup of tea and burning a small forest.
More seriously, I think Cetaceans would be a more tractable problem, and perhaps from there other animals. Wouldn't surprise me if it was already underway, while LLMs aren't good for a lot of things techbros want them to be good at, they are good at modelling languages.
A Priest, a Rabbi and a Minister walk into a bar.
Bartender says, that's it, you guys are cut off.