I wanted to extract some crime statistics broken by the type of crime and different populations, all of course normalized by the population size. I got a nice set of tables summarizing the data for each year that I requested.

When I shared these summaries I was told this is entirely unreliable due to hallucinations. So my question to you is how common of a problem this is?

I compared results from Chat GPT-4, Copilot and Grok and the results are the same (Gemini says the data is unavailable, btw :)

So is are LLMs reliable for research like that?

  • DavidGarcia@feddit.nl
    link
    fedilink
    arrow-up
    1
    ·
    2 months ago

    If generation temperature is non-zero (which it often is), there is inherent randomness to the output. So even if the first number in a statistic should be 1, sometimese it will just randomly pick any other plausible number. Even if the network always picks the correct token as the highest probability, it’s basically doing a coin toss for every token to make answers more creative.

    That’s on top of hoping the LLM has even seen that data during training AND managed to memorize it during training AND that the networks just happens to be able to reproduce the correct data given your prompt (it might not be able to for a different prompt).

    If you want any reliability at all, you need to use RAG AND also you yourself have to double check all the references it quotes (if it even has that capability).

    Even if it has all the necessary information to answer correctly in it’s context window, it can still answer incorrectly.

    None of the current models are anywhere close to producing trustworthy output 100% of the time.