Nate 3/9/25 Nate 3/9/25

Post Determinism: Correctness and Confidence

In traditional software, correctness is binary—code either works or it doesn't. But in the age of AI, we've entered a new paradigm where correctness exists on a spectrum. Discover how modern developers measure and validate AI system outputs, from token probabilities to semantic similarity scores, and learn why narrow, focused AI agents often outperform their general-purpose counterparts.

In traditional software development, correctness is binary: a function either works correctly or it doesn’t. A sort algorithm either orders the list properly or fails. A database query either returns the right records or doesn’t. In the age of AI and large language models, correctness more closely mirrors life and exists on a spectrum. When an AI generates a response, we can't simply assert a correct result. Instead, we must ask "How confident are we in this response?" and "What does correctness even mean in this context?"

Before we explore an implementation of confidence scoring in an application that uses Large Language Models to summarize vast amounts of data, let’s look at a few of the most common scoring metrics.

Token probabilities and top-k most likely tokens are common metrics under the umbrella of model-reported confidence. Given the following prompt:

prompt = 'Roses are red, violets are'
token_scores = compute_token_scores(model, prompt)
score_map = top_k_tokens(model, token_scores, k=3)
print(score_map)

The output might be:

{'▁Blue': 0.0003350259, '▁purple': 0.0009047602, '▁blue': 0.9984743}

In this example, the model has an extremely high probability of selecting _blue as the next token (after applying softmax normalization, which converts raw model outputs into probabilities that sum to 1). We could say that the model’s confidence is high; however, there are some caveats and limitations. We are simply estimating the probability of two arbitrary tokens, but in some cases, none of the top-k tokens may be likely or relevant. While model-reported confidence plays a role in the overall evaluation of output, it is clear that external validation is needed to ensure accuracy.

One powerful approach to external validation is Semantic Similarity assessment of the semantic resemblance between the generated answer and the ground truth. In short, semantic similarity involves comparing a known accurate answer to the LLM’s output for a given question. The more closely the two align, the more accurate the answer. Using a tool like Ragas, we can easily calculate this score.

from ragas.dataset_schema import SingleTurnSample
from ragas.metrics import SemanticSimilarity
from ragas.embeddings import LangchainEmbeddingsWrapper

sample = SingleTurnSample(
    response="The Eiffel Tower is located in Paris.",
    reference="The Eiffel Tower is located in Paris. It has a height of 1000ft."
)

scorer = SemanticSimilarity(embeddings=LangchainEmbeddingsWrapper(evaluator_embedding))
await scorer.single_turn_ascore(sample)

Output

0.8151371879226978

A score of 0.81 indicates strong semantic similarity - the model's response captures most of the key information from the reference answer, though it omits the height detail. In this example, the vectorized ground truth answer fairly well resembled the response. While this provides a more accurate scoring method than top-k sampling, it requires a dataset of questions and answers. This dataset must also closely resemble your production data. For example, calculating semantic similarity on a database of questions about English grammar won’t give insight into model performance if its real-world use will be a customer service agent for a bank.

In practice, developers use a combination of these techniques to evaluate model performance. This multi-faceted approach helps mitigate the limitations of any single confidence metric. Limiting the scope of input allows for simpler and more accurate model scoring – scoring results for an application designed to summarize highly structured data from a single knowledge domain will be more reliable than those from an application designed to take any possible unstructured user input. For example, a medical diagnosis system focused solely on radiology reports will likely achieve higher confidence scores than a general-purpose medical chatbot. This is one of the reasons that implementing a series of “agents” that each address a specific, well-defined problem domain and can be tested separately are becoming a popular approach. Ensuring input quality while maintaining domain specificity and low task complexity is the most direct way to obtain high quality output with minimal effort.

Nate 2/17/25 Nate 2/17/25

Prompt Engineering: Art and Science

Effective prompt engineering is an art as much as it is a science. Programmers can ensure quality LLM output in their apps by following established prompting frameworks.

As artificial intelligence becomes more deeply integrated into business operations, the art and science of prompt engineering is emerging as an essential skill among knowledge workers. Understanding how to get the most out of large language models will quickly become a competitive differentiator that gives these employees a significant edge in the workplace. As AI adoption accelerates, businesses will increasingly invest in and value prompt engineering expertise.

Large language models are highly probabilistic. Given the same prompt, the model might not always produce the same response, especially when randomness is introduced through parameters like temperature and top-k sampling. While this probabilistic nature helps generate diverse and creative outputs, many business use cases require consistency, reliability, and precision. Sound prompt engineering does not eliminate AI’s probabilistic nature but strategically narrows the range of outputs, making responses more predictable and valuable for structured applications.

COSTAR

The COSTAR prompt framework provides a structured approach to prompting that ensures the key data points that influence an LLM’s response are provided to the model:

Context: Provide background information that helps the LLM understand the specific scenario.
Objective: Clearly defining the tasks focuses the LLM’s output.
Style: What writing style should the response have?
Tone: What tone should the response have (motivational, friendly, etc.)?
Audience: Who is using the LLM
Response: Provide a specific response format (text, JSON, etc.).

Below is an example of a system prompt for a summarization analysis assistant:

# CONTEXT
You are a precision-focused text analysis system designed to evaluate summary accuracy. You analyze both the original text and its summary to determine how well the summary captures the essential information and meaning of the source material.

# OBJECTIVE
Compre an original text with its summary to:
1. Calculate a similarity score between 0.00 and 1.00 (where 1.00 represents perfect accuracy)
2. Provide clear reasoning for the score
3. Identify specific elements that influenced the scoring

# STYLE
Clear, precise, and analytical, focusing on concrete examples from both texts to support the evaluation.

# TONE
Objective and factual, like a scientific measurement tool.

# AUDIENCE
Users who need quantitative and qualitative assessment of summary accuracy, requiring specific numerical feedback.

# RESPONSE FORMAT
Output should be structured as follows:
1. Accuracy Score: [0.00-1.00]
2. Score Explanation:
    - Key factors that raised the score
    - Key factors that lowered the score
    - Specific examples from both texts to support the assessment
3. Brief conclusion summarizing the main reasons for the final score

**NOTE:** Always maintain score precision to two decimal places (e.g., 0.87, 0.45, 0.92)

Structured Outputs

Our example above leaves the exact response format up to the model. This strategy works well for a text-based chatbot, but what if we want to use the API to retrieve data that our application will consume? Any break in the expected format will result in a parsing error and cause our program to throw an exception. Defining an output structure for the model provides two main advantages:

Type-safety: Validation of response format and data type are not required.
Simplified Prompting: No need to precisely explain data formats and/or provide examples to ensure proper response format.

I created an object named accuracy_score with three properties, each representing one of our requested outputs.

{
  "name": "accuracy_score",
  "schema": {
    "type": "object",
    "properties": {
      "score": {
        "type": "number",
        "description": "The accuracy score as a float ranging from 0.00 to 1.00."
      },
      "score_explanation": {
        "type": "string",
        "description": "A description or explanation of the accuracy score."
      },
      "conclusion": {
        "type": "string",
        "description": "A concluding statement based on the accuracy score."
      }
    },
    "required": [
      "score",
      "score_explanation",
      "conclusion"
    ],
    "additionalProperties": false
  },
  "strict": true
}

I can easily reference my schema within my application by defining a response format sent with each request. Any request referencing my response format is now guaranteed to be correct in type and format. My app can always rely on accurate data when retrieving the values of the score, score_explanation, and conclusion properties.

response_format: { "type": "json_schema", "json_schema": … , "strict": true }