Nate 3/25/25 Nate 3/25/25

The Path to Profitability

The next trillion-dollar tech company won't be built by creating the best AI model - it will be built by controlling how, where, and why people use AI.

The next trillion-dollar tech company won't be built by creating the best AI model - it will be built by controlling how, where, and why people use AI. Just as Microsoft didn't win the PC era by making the best chips, today's AI leaders are racing not just to build better models, but to become the indispensable platform through which AI is accessed and deployed. Last quarter, over 50% of all venture capital funding went to AI-focused companies, totaling more than $60 billion of investment. The extreme rate with which VC firms are pouring money into Artificial Intelligence begs an important question: what is the path to profitability for companies like OpenAI?

AI startups have a relatively straightforward revenue model – subscription-based pricing for access to cutting-edge models and usage-based pricing for access to the API. Extremely high R&D and capital costs make attracting more paid users a requirement, but doing this also increases their marginal costs of inference. Therefore, paid users must offset their own marginal costs while subsidizing the costs of free users before beginning to chip away at those massive fixed costs that investors are currently covering. How can these aggressive revenue numbers be achieved without the assistance of external capital?

OpenAI and the other labs creating foundation models have no competitive moat and few products that fit wide swaths of the market. They have APIs that allow developers to create products, but that further commoditizes their main strength.

The Moat of the Personal Computer Revolution

Let’s look at the first transformational change in technology to see how our current future might play out. Much like our foundation models today, the integrated circuit and transistor were commodities. The moat that helped lock in decades of Windows/Intel hegemony was Microsoft’s platform deals.

It wasn’t the hardware itself that dictated dominance, but the strategic positioning of software as a control point—Windows became the layer through which users interacted with computing, and Intel became the default engine beneath. The real power lay not in the invention, but in the ecosystem built around it: developer tools, third-party software, enterprise integrations, and, most critically, distribution deals that ensured Windows shipped on nearly every PC.

Foundation models may be the new transistors—powerful, essential, but increasingly commoditized. The question becomes: who builds the new “Windows” for AI? Will it be a developer platform, a ubiquitous interface layer, or a vertically integrated product experience? The companies that succeed in building sticky platforms around foundation models—whether through proprietary data, user workflows, or ecosystem lock-in—may become the long-term leaders in artificial intelligence.

It’s interesting to note the trend of OpenAI enhancing its API with features typically reserved for subscribers at a faster pace. They’re also adding features to their API that will make it more difficult for developers to switch model providers. Look no further than how their new responses API compares to the original chat completions API. The new API is stateful, which makes it both easier to build solutions with and far more difficult to switch out with the API of another provider.

But there’s another layer to this. We’re seeing the early signs of this platform consolidation. Just as Microsoft leveraged pre-installation deals and developer incentives to make Windows indispensable, today’s AI leaders are racing to become the default platform on which others build. Microsoft’s investment in OpenAI, Anthropic’s partnership with Amazon, and Google’s embedding of Gemini into its product suite all mirror those earlier moves—where distribution, not just innovation, becomes the true competitive edge. The players who control not just the models, but the channels of user interaction—browsers, operating systems, productivity tools, search, or even chip infrastructure—are best positioned to own the AI era.

We’re entering a phase where control over context becomes as important as control over compute. Just as Windows became the context for productivity and the web browser the context for search, whoever owns the AI context—how, where, and why users invoke intelligence—will shape the future. That’s the real platform play.

Nate 3/9/25 Nate 3/9/25

Post Determinism: Correctness and Confidence

In traditional software, correctness is binary—code either works or it doesn't. But in the age of AI, we've entered a new paradigm where correctness exists on a spectrum. Discover how modern developers measure and validate AI system outputs, from token probabilities to semantic similarity scores, and learn why narrow, focused AI agents often outperform their general-purpose counterparts.

In traditional software development, correctness is binary: a function either works correctly or it doesn’t. A sort algorithm either orders the list properly or fails. A database query either returns the right records or doesn’t. In the age of AI and large language models, correctness more closely mirrors life and exists on a spectrum. When an AI generates a response, we can't simply assert a correct result. Instead, we must ask "How confident are we in this response?" and "What does correctness even mean in this context?"

Before we explore an implementation of confidence scoring in an application that uses Large Language Models to summarize vast amounts of data, let’s look at a few of the most common scoring metrics.

Token probabilities and top-k most likely tokens are common metrics under the umbrella of model-reported confidence. Given the following prompt:

prompt = 'Roses are red, violets are'
token_scores = compute_token_scores(model, prompt)
score_map = top_k_tokens(model, token_scores, k=3)
print(score_map)

The output might be:

{'▁Blue': 0.0003350259, '▁purple': 0.0009047602, '▁blue': 0.9984743}

In this example, the model has an extremely high probability of selecting _blue as the next token (after applying softmax normalization, which converts raw model outputs into probabilities that sum to 1). We could say that the model’s confidence is high; however, there are some caveats and limitations. We are simply estimating the probability of two arbitrary tokens, but in some cases, none of the top-k tokens may be likely or relevant. While model-reported confidence plays a role in the overall evaluation of output, it is clear that external validation is needed to ensure accuracy.

One powerful approach to external validation is Semantic Similarity assessment of the semantic resemblance between the generated answer and the ground truth. In short, semantic similarity involves comparing a known accurate answer to the LLM’s output for a given question. The more closely the two align, the more accurate the answer. Using a tool like Ragas, we can easily calculate this score.

from ragas.dataset_schema import SingleTurnSample
from ragas.metrics import SemanticSimilarity
from ragas.embeddings import LangchainEmbeddingsWrapper

sample = SingleTurnSample(
    response="The Eiffel Tower is located in Paris.",
    reference="The Eiffel Tower is located in Paris. It has a height of 1000ft."
)

scorer = SemanticSimilarity(embeddings=LangchainEmbeddingsWrapper(evaluator_embedding))
await scorer.single_turn_ascore(sample)

Output

0.8151371879226978

A score of 0.81 indicates strong semantic similarity - the model's response captures most of the key information from the reference answer, though it omits the height detail. In this example, the vectorized ground truth answer fairly well resembled the response. While this provides a more accurate scoring method than top-k sampling, it requires a dataset of questions and answers. This dataset must also closely resemble your production data. For example, calculating semantic similarity on a database of questions about English grammar won’t give insight into model performance if its real-world use will be a customer service agent for a bank.

In practice, developers use a combination of these techniques to evaluate model performance. This multi-faceted approach helps mitigate the limitations of any single confidence metric. Limiting the scope of input allows for simpler and more accurate model scoring – scoring results for an application designed to summarize highly structured data from a single knowledge domain will be more reliable than those from an application designed to take any possible unstructured user input. For example, a medical diagnosis system focused solely on radiology reports will likely achieve higher confidence scores than a general-purpose medical chatbot. This is one of the reasons that implementing a series of “agents” that each address a specific, well-defined problem domain and can be tested separately are becoming a popular approach. Ensuring input quality while maintaining domain specificity and low task complexity is the most direct way to obtain high quality output with minimal effort.

Nate 3/5/25 Nate 3/5/25

Post-Determinism: The End of Predictable Computing

The infusion of AI and LLM capabilities into software development is driving a paradigm shift on multiple levels. Design patterns are evolving – from harnessing in-context learning and retrieval strategies to new agent-based models – making software more adaptive and intelligent by design.

Since the invention of the computer, we’ve operated under a simple premise: computers are predictable machines that precisely follow a defined set of instructions. This fundamental assumption has shaped how we design software and interact with our devices. Artificial Intelligence has fundamentally upended the established 50-year paradigm of deterministic computing. Post-determinism marks a shift from computers as rigid executors of instructions to adaptable, probabilistic systems that generate responses based on learned patterns rather than explicit code. We’re entering an era where computers can interpret, create, and surprise us. This shift from predictable to probabilistic computing isn’t just a technical evolution – it represents a complete transformation in how we must think and interact with technology. Unlike deterministic systems where failure modes are predictable, AI-driven software introduces new risks, including bias, non-deterministic outputs, and emergent behaviors that challenge traditional software engineering principles. Developers must rethink their approach to software design, including reimagining potential use cases in light of these new capabilities.

AI is fundamentally reshaping software development. From design patterns to architecture choices, AI capabilities are introducing new paradigms that augment or even replace traditional approaches. This transformation is evident in how we design systems, plan solutions, and build features. In this upcoming series of articles, I’ll explore the emergence of new software design patterns, broader changes in solution design, and the exploration of solutions once out of reach using traditional software development. Can software still be “debugged” when outputs are not deterministic? What does software reliability look like when outputs are probabilistic? How do we ensure accountability when AI-driven software makes decisions?

Nate 2/17/25 Nate 2/17/25

Prompt Engineering: Art and Science

Effective prompt engineering is an art as much as it is a science. Programmers can ensure quality LLM output in their apps by following established prompting frameworks.

As artificial intelligence becomes more deeply integrated into business operations, the art and science of prompt engineering is emerging as an essential skill among knowledge workers. Understanding how to get the most out of large language models will quickly become a competitive differentiator that gives these employees a significant edge in the workplace. As AI adoption accelerates, businesses will increasingly invest in and value prompt engineering expertise.

Large language models are highly probabilistic. Given the same prompt, the model might not always produce the same response, especially when randomness is introduced through parameters like temperature and top-k sampling. While this probabilistic nature helps generate diverse and creative outputs, many business use cases require consistency, reliability, and precision. Sound prompt engineering does not eliminate AI’s probabilistic nature but strategically narrows the range of outputs, making responses more predictable and valuable for structured applications.

COSTAR

The COSTAR prompt framework provides a structured approach to prompting that ensures the key data points that influence an LLM’s response are provided to the model:

Context: Provide background information that helps the LLM understand the specific scenario.
Objective: Clearly defining the tasks focuses the LLM’s output.
Style: What writing style should the response have?
Tone: What tone should the response have (motivational, friendly, etc.)?
Audience: Who is using the LLM
Response: Provide a specific response format (text, JSON, etc.).

Below is an example of a system prompt for a summarization analysis assistant:

# CONTEXT
You are a precision-focused text analysis system designed to evaluate summary accuracy. You analyze both the original text and its summary to determine how well the summary captures the essential information and meaning of the source material.

# OBJECTIVE
Compre an original text with its summary to:
1. Calculate a similarity score between 0.00 and 1.00 (where 1.00 represents perfect accuracy)
2. Provide clear reasoning for the score
3. Identify specific elements that influenced the scoring

# STYLE
Clear, precise, and analytical, focusing on concrete examples from both texts to support the evaluation.

# TONE
Objective and factual, like a scientific measurement tool.

# AUDIENCE
Users who need quantitative and qualitative assessment of summary accuracy, requiring specific numerical feedback.

# RESPONSE FORMAT
Output should be structured as follows:
1. Accuracy Score: [0.00-1.00]
2. Score Explanation:
    - Key factors that raised the score
    - Key factors that lowered the score
    - Specific examples from both texts to support the assessment
3. Brief conclusion summarizing the main reasons for the final score

**NOTE:** Always maintain score precision to two decimal places (e.g., 0.87, 0.45, 0.92)

Structured Outputs

Our example above leaves the exact response format up to the model. This strategy works well for a text-based chatbot, but what if we want to use the API to retrieve data that our application will consume? Any break in the expected format will result in a parsing error and cause our program to throw an exception. Defining an output structure for the model provides two main advantages:

Type-safety: Validation of response format and data type are not required.
Simplified Prompting: No need to precisely explain data formats and/or provide examples to ensure proper response format.

I created an object named accuracy_score with three properties, each representing one of our requested outputs.

{
  "name": "accuracy_score",
  "schema": {
    "type": "object",
    "properties": {
      "score": {
        "type": "number",
        "description": "The accuracy score as a float ranging from 0.00 to 1.00."
      },
      "score_explanation": {
        "type": "string",
        "description": "A description or explanation of the accuracy score."
      },
      "conclusion": {
        "type": "string",
        "description": "A concluding statement based on the accuracy score."
      }
    },
    "required": [
      "score",
      "score_explanation",
      "conclusion"
    ],
    "additionalProperties": false
  },
  "strict": true
}

I can easily reference my schema within my application by defining a response format sent with each request. Any request referencing my response format is now guaranteed to be correct in type and format. My app can always rely on accurate data when retrieving the values of the score, score_explanation, and conclusion properties.

response_format: { "type": "json_schema", "json_schema": … , "strict": true }

Nate 2/15/25 Nate 2/15/25

Apple is Missing the AI Race

Apple is failing to implement artificial intelligence in a way that plays to their greatest strengths.

Apple has two major advantages in the AI race. Their ARM-based SoC’s unified memory architecture allows the GPU and Neural Engine access to far more RAM than competitors. This architectural advantage allows excellent performance of smaller models running on device as the context window grows. Each token requires a key/value pair, which causes the memory footprint to quickly grow as individual conversations get longer. Apple is not taking advantage of their most valuable resource, the platform advantage – access to all of my personal data.

Mark Gurman at Bloomberg reports:

“The goal is to ultimately offer a more versatile Siri that can seamlessly tap into customers’ information and communication. For instance, users will be able to ask for a file or song that they discussed with a friend over text. Siri would then automatically retrieve that item. Apple also has demonstrated the ability for Siri to quickly locate someone’s driver’s license number by reviewing their photos.”

This is Apple’s competitive differentiator and where Apple should have focused its resources from the start. Why can't I ask questions about my archived email or find correlations in exercise volume and sleep quality within the Health app?

Apple’s real AI advantage isn’t just hardware — it’s the platform. A company that prides itself on tight integration across devices should be leading in AI that understands me. The ability to surface insights from my personal data, securely and privately, is where Apple could create the most compelling user experience.