the BBC asked ChatGPT, Copilot, Gemini and Perplexity to summarise 100 news stories and rated each answer. […] It found 51% of all AI answers to questions about the news were judged to have significant issues of some form. […] 19% of AI answers which cited BBC content introduced factual errors, such as incorrect factual statements, numbers and dates.
It makes me remember I basically stopped using LLMs for any summarization after this exact thing happened to me. I realized that without reading the text, I wouldn’t be able to know whether the output has all the relevant info or if it has some made-up info.
So for something you can’t objectively evaluate? Looking at Apple’s garbage generator, LLMs aren’t even good at summarising.
For reference:
AI chatbots unable to accurately summarise news, BBC finds
It makes me remember I basically stopped using LLMs for any summarization after this exact thing happened to me. I realized that without reading the text, I wouldn’t be able to know whether the output has all the relevant info or if it has some made-up info.