Post Determinism: Correctness and Confidence

In traditional software development, correctness is binary: a function either works correctly or it doesn’t. A sort algorithm either orders the list properly or fails. A database query either returns the right records or doesn’t. In the age of AI and large language models, correctness more closely mirrors life and exists on a spectrum. When an AI generates a response, we can't simply assert a correct result. Instead, we must ask "How confident are we in this response?" and "What does correctness even mean in this context?"

Before we explore an implementation of confidence scoring in an application that uses Large Language Models to summarize vast amounts of data, let’s look at a few of the most common scoring metrics.

Token probabilities and top-k most likely tokens are common metrics under the umbrella of model-reported confidence. Given the following prompt:

prompt = 'Roses are red, violets are'
token_scores = compute_token_scores(model, prompt)
score_map = top_k_tokens(model, token_scores, k=3)
print(score_map)

The output might be:

{'▁Blue': 0.0003350259, '▁purple': 0.0009047602, '▁blue': 0.9984743}

In this example, the model has an extremely high probability of selecting _blue as the next token (after applying softmax normalization, which converts raw model outputs into probabilities that sum to 1). We could say that the model’s confidence is high; however, there are some caveats and limitations. We are simply estimating the probability of two arbitrary tokens, but in some cases, none of the top-k tokens may be likely or relevant. While model-reported confidence plays a role in the overall evaluation of output, it is clear that external validation is needed to ensure accuracy.

One powerful approach to external validation is Semantic Similarity assessment of the semantic resemblance between the generated answer and the ground truth. In short, semantic similarity involves comparing a known accurate answer to the LLM’s output for a given question. The more closely the two align, the more accurate the answer. Using a tool like Ragas, we can easily calculate this score.

from ragas.dataset_schema import SingleTurnSample
from ragas.metrics import SemanticSimilarity
from ragas.embeddings import LangchainEmbeddingsWrapper

sample = SingleTurnSample(
    response="The Eiffel Tower is located in Paris.",
    reference="The Eiffel Tower is located in Paris. It has a height of 1000ft."
)

scorer = SemanticSimilarity(embeddings=LangchainEmbeddingsWrapper(evaluator_embedding))
await scorer.single_turn_ascore(sample)

Output

0.8151371879226978

A score of 0.81 indicates strong semantic similarity - the model's response captures most of the key information from the reference answer, though it omits the height detail. In this example, the vectorized ground truth answer fairly well resembled the response. While this provides a more accurate scoring method than top-k sampling, it requires a dataset of questions and answers. This dataset must also closely resemble your production data. For example, calculating semantic similarity on a database of questions about English grammar won’t give insight into model performance if its real-world use will be a customer service agent for a bank.

In practice, developers use a combination of these techniques to evaluate model performance. This multi-faceted approach helps mitigate the limitations of any single confidence metric. Limiting the scope of input allows for simpler and more accurate model scoring – scoring results for an application designed to summarize highly structured data from a single knowledge domain will be more reliable than those from an application designed to take any possible unstructured user input. For example, a medical diagnosis system focused solely on radiology reports will likely achieve higher confidence scores than a general-purpose medical chatbot. This is one of the reasons that implementing a series of “agents” that each address a specific, well-defined problem domain and can be tested separately are becoming a popular approach. Ensuring input quality while maintaining domain specificity and low task complexity is the most direct way to obtain high quality output with minimal effort.

Previous
Previous

The Path to Profitability

Next
Next

Post-Determinism: The End of Predictable Computing