Well yeah - because that’s not how LLMs work. They generate sentences that conform to the word-relationship statistics that were generated during the training (e.g. making comparisons between all the data the model was trained on). It does not have any kind of logic and it does not know things. It literally just navigates a complex web of relationships between words using the prompt as a guide, creating sentences that look statistically similar to the average of all trained sentences.
TL;DR; It’s an illusion. You don’t need to run experiments to realize this, you just need to understand how AI/ML works.