I wanted to extract some crime statistics broken by the type of crime and different populations, all of course normalized by the population size. I got a nice set of tables summarizing the data for each year that I requested.
When I shared these summaries I was told this is entirely unreliable due to hallucinations. So my question to you is how common of a problem this is?
I compared results from Chat GPT-4, Copilot and Grok and the results are the same (Gemini says the data is unavailable, btw :)
So is are LLMs reliable for research like that?
Definitely. The thing you might want to consider as well is what you are using it for. Is it professional? Not reliable enough. Is it to try to understand things a bit better? Well, it’s hard to say if it’s reliable enough, but it’s heavily biased just as any source might be, so you have to take that into account.
I don’t have the experience to tell you how to suss out its biases. Sometimes, you can push it in one direction or another with your wording. Or with follow-up questions. Hallucinations are a thing but not the only concern. Cherrypicking, lack of expertise, the bias of the company behind the llm, what data the llm was trained on, etc.
I have a hard time understanding what a good way to double-check your llm is. I think this is a skill we are currently learning, as we have been learning how to sus out the bias in a headline or an article based on its author, publication, platform, etc. But for llms, it feels fuzzier right now. For certain issues, it may be less reliable than others as well. Anyways, that’s my ramble on the issue. Wish I had a better answer, if only I could ask someone smarter than me.
Oh, here’s gpt4o’s take.
When considering the accuracy and biases of large language models (LLMs) like GPT, there are several key factors to keep in mind:
1. Training Data and Biases
2. Accuracy and Hallucinations
3. Context and Ambiguity
4. Updates and Recency
5. Mitigating Biases and Ensuring Accuracy
6. Ethical Considerations
In summary, while LLMs can provide valuable assistance in generating text and answering queries, their accuracy is not guaranteed, and their outputs may reflect biases present in their training data. Users should use them as tools to aid in tasks, but not as infallible sources of truth. It is essential to apply critical thinking and, when necessary, consult additional reliable sources to verify information.