Member-only story
LLM Evaluation — Everything You Need To Know!
In the last few years, LLMs have taken the field of AI by storm with their amazing ability to generate human-like text. But the question remains, how good or even valid is the text generated?
In this article, we primarily focus on the assessment of LLMs.
What is an LLM evaluation?
LLMs are trained to generate text for a specific task, hence we evaluate the generated text on that task using different metrics. These LLM tasks can vary from generating answers for fact-based questions to generating summaries for given a document.
Sample
Considering, that I have a fine-tuned LLM for QA tasks on the topic of Finance, and it is only bound to answer on finance topic.
For instance, if I ask a question, “What is a stock market?”, LLM’s answer must involve “talking about trading shares on a platform or something close to it”. But if the LLM is answering about the “market where the fruits are sold,” then it means the LLM has a poor context relevance. Similarly, there are various scenarios where the LLM can go wrong from answering fact based questions to generating violating content.
In conclusion, based on the tasks, the LLM‘s metrics changes!
LLM Evaluation Metrics
Unlike traditional machine learning, evaluating LLM’s result is not straightforward. For instance, asking an LLM to write a story “About a dog and a cat”, can generate a wide range of responses, and evaluating this different range of responses requires specialized custom metrics.
However, before we jump into custom metrics for LLM evaluation, let us talk about common metrics used for evaluating LLM.
- Answer relevancy
- Hallucination
- Context relevancy
- Ethical metrics
- Task-specific metrics
Moreover, the metrics mentioned above are great only if these metrics follow the below…