Ai Agents

Agents : Architecture and Design

Harshit Sharma — Sat, 15 Mar 2025 07:56:10 GMT

The term “agent” can have multiple interpretations. Some define agents as fully autonomous systems that operate independently over extended periods, leveraging various tools to complete complex tasks. Others use the term to refer to more structured implementations that adhere to predefined workflows.

We categorize all these variations as agentic systems, but we draw an important architectural distinction between workflows and agents:

• Workflows are structured systems where LLMs and tools operate through predefined code paths.

• Agents, in contrast, dynamically determine their own processes and tool usage, maintaining control over how they accomplish tasks.

Workflows

When complexity increases, workflows offer predictability and consistency for well-defined tasks, whereas agentsexcel when flexibility and model-driven decision-making are needed at scale.

Types of Workflows

Prompt Chaining

A structured sequence of LLM interactions, where each step builds on the previous one.

Routing

When to use it: Best suited for complex tasks that require distinct handling for different categories. This workflow relies on classification—either by an LLM or a traditional model/algorithm—to direct tasks down the appropriate paths.

Parallelization

When to use it: Ideal when subtasks can be executed simultaneously for increased speed or when multiple perspectives or attempts improve accuracy. LLMs tend to perform better when each consideration in a complex task is handled through a separate LLM call.

Orchestrator-Workers

When to use it: This workflow is effective for complex tasks where the necessary subtasks cannot be predicted in advance. Unlike parallelization, where tasks are predefined, an orchestrator determines which subtasks are needed based on input.

Example: Coding tools that modify multiple files dynamically, depending on the nature of the requested changes.

Evaluator-Optimizer

In this iterative workflow, one LLM generates a response while another evaluates and refines it in a feedback loop.

Agents

When to use agents: Agents are ideal for open-ended problems where the number of required steps is unpredictable, and a fixed process cannot be hardcoded. They enable LLMs to operate autonomously over multiple turns, making them well-suited for scaling tasks in trusted environments where decision-making flexibility is essential.

Summary

We often mistake workflows for agents, assuming they operate the same way. However, when I started using Claude Code, I truly understood what an agent actually is. Many times, we’re using workflows but believe there’s an agent working behind the scenes.

Above definitions and diagrams are from Anthropic guidelines towards building effective agents https://www.anthropic.com/engineering/building-effective-agents

Agents : Memory

Harshit Sharma — Mon, 10 Mar 2025 06:01:22 GMT

Implementing Long-Term Memory in AI Agents (Semantic, Episodic, Procedural) with LangMem

AI agents powered by large language models (LLMs) can appear more intelligent and personalized when they remember information over time. By equipping agents with long-term memory, developers enable them to recall facts, past interactions, and learned skills beyond a single chat session. In this article, we’ll conduct a deep dive into the role of memory in AI agents, focusing on the three key types of memory – semantic, episodic, and procedural – and how to implement each.

We’ll explore conceptual differences between these memory types, practical strategies for integrating memory into AI systems, and specific techniques using the LangMem framework (a toolkit for long-term memory in LangChain). We’ll also discuss optimization techniques like delayed memory processing, dynamic namespaces, efficient retrieval, and the performance trade-offs involved in giving your AI a long-term memory.

Conceptual Understanding: Memory Types in AI Agents

Memory Type	Purpose	Agent Example	Human Example	Typical Storage Pattern
Semantic	Facts & Knowledge	User preferences; knowledge triplets	Knowing Python is a programming language	Profile or Collection
Episodic	Past Experiences	Few-shot examples; Summaries of past conversations	Remembering your first day at work	Collection
Procedural	System Behavior	Core personality and response patterns	Knowing how to ride a bicycle	Prompt rules or Collection

Source : https://blog.langchain.dev/langmem-sdk-launch/

Just as humans have multiple forms of memory, AI agents can benefit from different memory types for different purposes . In cognitive terms, we can draw analogies to human memory when designing AI agent memory:

• Semantic Memory (Facts & Knowledge): Semantic memory is about storing factual information or general knowledge an agent has learned . In humans, this is like remembering that Paris is the capital of France or Python is a programming language. For AI, semantic memory might include facts about the user or world that were learned during interactions or provided as data (e.g. a user’s name, preferences, key domain facts) . This memory enables the agent to ground its responses with correct details and personalization. Example: A virtual assistant’s semantic memory could store that the user’s favorite cuisine is Italian and their birthday is July 20th, so it can later recommend Italian restaurants or send birthday wishes.

• Episodic Memory (Events & Experiences): Episodic memory records specific experiences or past events . For humans, episodic memories might be recollections of your first day at work or a memorable trip. In AI agents, episodic memory means remembering past dialogues or problem-solving episodes – essentially the agent’s own experiences in dealing with certain situations . This could include summaries of previous conversations, successful task outcomes, or mistakes made, along with the context in which they occurred.

Episodic memory helps the agent recall “how did I handle this before?” and apply that experience to guide future behavior . Example: A customer support chatbot’s episodic memory might include a summary of the last support session with a user, so if the user returns, the bot remembers what was tried before and what the result was.

• Procedural Memory (Skills & Behaviors): Procedural memory captures the know-how for performing tasks, encompassing rules, skills, or policies the agent follows . In humans, this is like the ingrained skill of riding a bicycle or playing the piano – you might not recall a specific event, but you have internalized how to do it. For AI agents, procedural memory manifests in the agent’s core behavior: it can be encoded in the model’s weights, in the agent’s code, or importantly in the system prompts and instructions that guide its responses . By updating its procedural memory, an agent can learn new behaviors or refine its style over time without changing its underlying model weights . Example: An AI coding assistant might learn over time to adopt a more detailed code commenting style after it consistently gets user feedback asking for more explanation. This learned behavior is stored as an adjustment to its system prompt (procedural memory), so future code outputs include better comments by default.

Semantic memory gives the agent a knowledge base of facts (the “**what”)
Episodic memory provides it with personal experiences (the “when and how” of past events)
Procedural memory governs its inherent skills or behaviors (the “how to do” rules).

In practice, an AI agent will typically use a combination of all three to achieve more intelligent and personalized interactions. For example, a sophisticated personal assistant might use semantic memory to recall a user’s preferences, episodic memory to remember the context of previous conversations with that user, and procedural memory to adapt its tone or strategy based on what has been effective in the past.

Implementation Strategies for AI Agent Memory

How can developers equip AI agents with these forms of memory? Implementing memory in AI systems involves deciding what information to store, how to store it, and when to retrieve or update it during conversations . Below, we outline strategies for integrating each type of memory into an AI agent, along with real-world use cases and best practices. There are implementations using Langmem , do checkout some examples here : https://github.com/linux-devil/llm_learning/tree/main/agent_memory

Semantic Memory Implementation (Facts & Knowledge)

To give an AI agent semantic memory, you need a mechanism to capture facts or details and store them in a retrievable format. A common strategy is to use a knowledge base or database (often vector databases for semantic search) to store facts extracted from interactions:

• Extracting Facts: After or during a conversation, you can run a process to extract key facts or data points that emerged. This could be done by calling an LLM to summarize the conversation or pull out structured facts (e.g. in JSON). For example, if a user mentions their birthday or a new preference, the agent should record that fact. Many developers implement this by writing a prompt like “List any new facts the user stated about themselves” and having the LLM output those facts for storage.

To use it, you create a memory manager specifying the LLM and optionally a schema for the information you want to extract. For example, if you want to store user preferences as facts, you might define a Pydantic model for a UserPreference or UserProfile and pass that schema to the manager. Below is an example of setting up a memory manager to extract a user’s profile information (name, preferred name, style, skills, etc.) from a conversation:

from pydantic import BaseModel, Field
from langmem import create_memory_manager

class UserProfile(BaseModel):
    """Save the user's preferences and traits."""
    name: str
    preferred_name: str
    response_style_preference: str
    special_skills: list[str]
    other_preferences: list[str]

manager = create_memory_manager(
    "anthropic:claude-3-5",  # LLM to use for extraction
    schemas=[UserProfile],
    instructions="Extract user preferences and settings",
    enable_inserts=False  # We'll update one profile document
)
# Assume we have a conversation list of messages (user and assistant turns)
profile_memories = manager.invoke({"messages": conversation})
profile = profile_memories[0].content  # This is a UserProfile object
print(profile)

In this snippet, the memory manager is instructed to extract the user’s preferences and settings from a conversation. When invoke is called with the conversation, the LLM processes it according to the schema and instructions, returning a UserProfile object filled with information gleaned from the dialogue . For example, if the user said “Hi! I’m Alex but please call me Lex. I’m a wizard at Python and love making AI systems that don’t sound like boring corporate robots…” and so on, the memory manager might produce a profile like:

name="Alex",
preferred_name="Lex",
response_style_preference="casual and witty with appropriate emojis",
special_skills=["Python programming", "AI development", "competitive speedcubing"],
other_preferences=["prefers informal communication", "dislikes corporate-style interactions"]

• Retrieving and Using Facts: When the agent is generating a response, relevant facts from semantic memory should be retrieved and injected into the prompt (often as part of the system message or additional context). This retrieval is typically done via semantic search: comparing the current query or conversation context with stored memory embeddings to find related info . For instance, if the user asks “Can you recommend a restaurant for tonight?”, the agent might query the memory store for the user’s known food preferences or past restaurant conversations, then use those facts (e.g. “user likes Italian”) to tailor its answer. Efficient retrieval may involve filtering by user or category and then ranking by similarity to ensure only the most relevant facts are included.

Episodic Memory Implementation (Past Experiences)

Episodic memory in AI agents refers to preserving the context and outcomes of past interactions – essentially, remembering stories or episodes from the agent’s experience. Implementing episodic memory often involves capturing conversation transcripts or distilled “experience logs” that the agent can refer to in the future:

• Capturing Episodes: Not every interaction needs to become an episodic memory. A typical approach is to store notable interactions, such as successful problem-solving sessions, important user interactions, or failures that the agent should learn from. One way to do this is by summarizing entire conversations or critical segments into a concise narrative. For instance, after a support chat that ends with a satisfied user, the system can summarize: “User had issue X, agent walked them through steps Y, issue resolved and user was happy” – this summary becomes an episodic memory. Some frameworks treat this like an “experience replay” concept, similar to reinforcement learning, where the agent writes down the key situation, the actions it took, and the result .

class Episode(BaseModel):  # 
    """Write the episode from the perspective of the agent within it. Use the benefit of hindsight to record the memory, saving the agent's key internal thought process so it can learn over time."""

    observation: str = Field(..., description="The context and setup - what happened")
    thoughts: str = Field(
        ...,
        description="Internal reasoning process and observations of the agent in the episode that let it arrive"
        ' at the correct action and result. "I ..."',
    )
    action: str = Field(
        ...,
        description="What was done, how, and in what format. (Include whatever is salient to the success of the action). I ..",
    )
    result: str = Field(
        ...,
        description="Outcome and retrospective. What did you do well? What could you do better next time? I ...",
    )

#  The Episode schema becomes part of the memory manager's prompt,
# helping it extract complete reasoning chains that guide future responses
manager = create_memory_manager(
    "openai:gpt-4o-mini",
    schemas=[Episode],
    instructions="Extract examples of successful explanations, capturing the full chain of reasoning. Be concise in your explanations and precise in the logic of your reasoning.",
    enable_inserts=True,
)

# Configure memory manager with storage
manager = create_memory_store_manager(
    "openai:gpt-4o-mini",
    namespace=("memories", "episodes"),
    schemas=[Episode],
    instructions="Extract exceptional examples of noteworthy problem-solving scenarios, including what made them effective.",
    enable_inserts=True,
)

• Structured Experience Logs: It can help to structure episodic memories for consistency. For example, you might define an Episode record with fields such as situation/context, action_taken (or chain of thought), and outcome . This is akin to a journal entry that explains how the agent approached a problem and what happened. By storing the agent’s internal reasoning along with the outcome, the agent can later analyze what strategies work well. This technique was demonstrated by the LangMem framework, which allows defining a Pydantic schema for an Episode and using an LLM to fill it out after an interaction . The episodic memory entry essentially captures “how did I get to a good answer and what was the result?” in a given scenario.

• Using Episodes for Learning: When faced with a new task, an agent can retrieve relevant episodes to guide its behavior. This often takes the form of few-shot prompting: supplying the model with examples from past episodes that are similar to the current context . Typically, episodic memories are retrieved by similarity (e.g., via an embedding search on the episode descriptions) or by tagging (metadata indicating the type of scenario). Including a highly relevant past example in the prompt can significantly improve performance on tasks that are similar to what the agent has seen before.

Real-world use case: Consider an AI tutoring system that helps students solve math problems. It can maintain episodic memories of each tutoring session: what problems were attempted, which hints helped, and where the student struggled. Later, if the same student (or even another student) faces a similar problem, the agent can recall the episode to avoid ineffective hints and apply strategies that proved successful. Another example: in AI game agents, episodic memory could be used to remember sequences of moves that led to victory or defeat, informing future decision-making.

Procedural Memory Implementation (Skills & Behaviors)

Procedural memory is about how an agent does things – its ingrained skills or policies. Implementing procedural memory in AI agents often means enabling the agent to learn from feedback and change its core behavior (usually its system prompt or internal rules) over time . Unlike semantic or episodic memory which store content to inject, procedural memory updates how the agent formulates responses. Here are strategies to implement it:

• Prompt Refinement (“Self-Improvement”): A straightforward way for an agent to adjust its behavior is to refine its system prompt or instructions based on experience. For example, suppose an agent consistently gets feedback that its answers are too verbose. The developer could manually edit the system prompt to say “be concise”, but with procedural memory, the agent can learn this itself. One technique is reflective prompt optimization – after some interactions, run a process where the agent (or a separate LLM) reviews conversation transcripts and feedback to propose an updated prompt or new rules for itself. This is sometimes called meta-prompting or the “reflect, then improve” approach . The agent essentially asks “How can I do better next time?” and writes an improved instruction set.

• Feedback Loops: To know how to change its behavior, the agent needs feedback or a measure of success. This can come from explicit user feedback (thumbs-up/down, corrections) or an automated evaluation of the agent’s responses (a reward score). Developers can implement a loop where after each interaction (or batch of interactions), if the outcome was poor, the agent’s procedural memory is updated. For instance, an agent that failed a task might add a rule “If the user asks X, remember to do Y first” to avoid repeating the mistake. If it succeeded, it might strengthen the behavior that led to success (like “Always double-check the user’s question for ambiguities”). This is analogous to Reinforcement Learning from Human Feedback (RLHF), but can be done on the fly with prompt engineering rather than weight tuning – by adding or adjusting instructions.

Real-world use case: An AI content generator might notice that its user prefers outputs in a certain style (e.g. with more humor). Procedural memory allows it to internalize this style guideline. Over time and with subtle feedback (the user tends to regenerate content until a humorous one appears, etc.), the agent can adapt its core writing style to be more humorous by default, without being explicitly told on each request. Another scenario: a multi-step task-solving agent (like those using ReAct or tool use) can refine its approach to using tools. If it observes that a particular sequence of tool uses leads to better outcomes (say, always do a web search before answering a question about current events), it can incorporate that as a new step in its policy through procedural memory.

Using the LangMem Framework for AI Agent Memory

Manually implementing the above memory mechanisms can be complex, but there are frameworks designed to simplify this. LangMem is a recently introduced SDK specifically for managing long-term memory in AI agents . It is part of the LangChain ecosystem and provides out-of-the-box tools to handle semantic, episodic, and procedural memory for agents. Let’s explore how LangMem supports these memory types and how developers can use it in practice.

Memory Tools and Integration

Beyond the core APIs, LangMem also provides memory tools that integrate into an agent’s reasoning loop. If you’re using a LangChain agent (like a ReAct agent), you can give the agent tools such as ManageMemory and SearchMemory so that the agent can decide when to store or retrieve memories during conversation . For example:

• create_manage_memory_tool(namespace=...) creates a tool that the agent can invoke to save new information. The agent would invoke this when it deems something worth remembering (perhaps guided by its prompt or chain logic).

• create_search_memory_tool(namespace=...) allows the agent to query its long-term memory. The agent might use this tool when it faces a question and wants to see if it “knows” something from before.

Summary

In summary, adding long-term memory to AI agents is extremely promising for creating more capable and personalized systems, but it requires careful thought to get right. By understanding the types of memory (semantic for facts, episodic for experiences, procedural for skills), using frameworks like LangMem to implement them, and applying optimization techniques, developers can build agents that learn and improve over time. The trade-offs can be managed by smart design: isolate and target what to remember, update memories at the right moments, and retrieve information efficiently. With these practices, you can give your AI agent a robust memory that significantly enhances its performance and user experience, while keeping the system scalable and cost-effective.

DeepGEMM: Clean and Efficient FP8 GEMM Library

Harshit Sharma — Sat, 08 Mar 2025 17:54:19 GMT

Introduction

DeepGEMM is a clean and efficient FP8 General Matrix Multiplication (GEMM) library with fine-grained scaling, released by DeepSeek as part of their "Open Source Week" in February 2025. It supports both normal dense GEMMs and Mixture-of-Experts (MoE) grouped GEMMs, providing high-performance matrix operations critical for modern AI models.

Background and Motivation

Matrix multiplication is at the heart of deep learning computations, particularly in transformer-based models. As models continue to scale, there's an increasing need for more efficient matrix operations that can leverage hardware capabilities while maintaining numerical stability. FP8 (8-bit floating point) has emerged as a promising data format that offers a balance between precision and computational efficiency.

However, implementing efficient FP8 GEMM operations is challenging, especially when considering the complexities of modern GPU architectures and the need for fine-grained scaling to maintain numerical stability. Existing libraries often involve complex template metaprogramming that makes them difficult to understand and modify.

DeepGEMM was developed to address these challenges by providing a clean, efficient, and accessible implementation of FP8 GEMM operations that achieves high performance while maintaining readability and extensibility.

Key Features and Capabilities

DeepGEMM offers several key features that make it valuable for AI model development:

Lightweight Design: Core kernel function of only ~300 lines of code, making it accessible as a learning resource while still delivering high performance.
Just-In-Time (JIT) Compilation: No compilation needed during installation, as kernels are compiled at runtime using a lightweight JIT module.
FP8 Support with Fine-Grained Scaling: Implements efficient FP8 operations with fine-grained scaling to maintain numerical stability.
Multiple GEMM Formats:
- Normal dense GEMM for standard matrix operations
- Grouped contiguous GEMM for MoE models with contiguous layout
- Grouped masked GEMM for MoE models with masked layout
High Performance: Achieves up to 1350+ FP8 TFLOPS on Hopper GPUs, matching or exceeding expert-tuned libraries across various matrix shapes.
Auto-Tuning: Automatically selects optimal kernel configurations for different matrix shapes and hardware setups.

Technical Implementation

DeepGEMM is implemented as a combination of Python and CUDA components with a focus on clean design and runtime optimization. The implementation consists of several key components:

JIT Compilation System:
- Compiles CUDA kernels at runtime using templates
- Caches compiled kernels for reuse
- Supports FFMA interleaving optimization for better performance
GEMM Kernels:
- Normal dense GEMM: gemm_fp8_fp8_bf16_nt
- Grouped contiguous GEMM: m_grouped_gemm_fp8_fp8_bf16_nt_contiguous
- Grouped masked GEMM: m_grouped_gemm_fp8_fp8_bf16_nt_masked
Auto-Tuning System:
- Automatically selects optimal kernel configurations
- Tunes parameters like block sizes, number of stages, and TMA multicast
Fine-Grained Scaling:
- Supports 1x128 LHS scaling and 128x128 RHS scaling
- Implements efficient TMA-aligned tensor handling

The library addresses the imprecise FP8 tensor core accumulation issue by implementing CUDA-core two-level accumulation (promotion), ensuring numerical stability without sacrificing performance.

Performance and Benchmarks

DeepGEMM demonstrates impressive performance across a wide range of matrix shapes and configurations:

Normal GEMMs for dense models

Achieves up to 1358 TFLOPS for large matrices (4096x7168x16384)
Provides 1.1x-2.7x speedup compared to optimized CUTLASS implementation
Excellent performance for both memory-bound and compute-bound configurations

Grouped GEMMs for MoE models

Supports both contiguous and masked layouts
Achieves up to 1297 TFLOPS for grouped contiguous layout
Provides 1.1x-1.2x speedup compared to optimized implementations

These performance characteristics make DeepGEMM particularly well-suited for large-scale models like DeepSeek-V3/R1, where efficient matrix operations are crucial for both training and inference.

Integration and Usage

DeepGEMM is designed to be easily integrated into existing deep learning frameworks. It provides a PyTorch-compatible interface that can be used to replace standard matrix multiplication operations with optimized FP8 implementations. The typical usage pattern involves:

Preparing input tensors in the appropriate format (FP8 with scaling factors)
Calling the appropriate GEMM function based on the operation type (normal, grouped contiguous, or grouped masked)
Processing the output tensor (BF16) as needed

The library also provides utility functions for tensor format conversion and configuration management.

Conclusion

DeepGEMM represents a significant advancement in optimizing matrix operations for modern AI models. By providing a clean, efficient implementation of FP8 GEMM operations, it addresses one of the key computational bottlenecks in large-scale models. Its support for both normal dense GEMMs and MoE grouped GEMMs makes it versatile for a wide range of model architectures, while its optimization for modern hardware ensures it can deliver maximum performance on current and future systems.

What sets DeepGEMM apart is its balance of performance and accessibility. While achieving performance that matches or exceeds expert-tuned libraries, it maintains a clean, readable codebase that serves as both a practical tool and an educational resource for understanding GPU optimization techniques. As AI models continue to scale and efficiency becomes increasingly critical, tools like DeepGEMM will play an essential role in pushing the boundaries of what's possible.

Understanding Load Balancing and Expert Parallelism in AI Models

Harshit Sharma — Wed, 05 Mar 2025 19:33:11 GMT

Expert Parallelism Load Balancer (EPLB)

In the world of AI and deep learning, managing the computational load efficiently is a critical task. Models, especially large-scale ones like those used in Natural Language Processing (NLP) or computer vision, require handling massive amounts of data and computation across multiple computing resources such as GPUs or server nodes. This task often involves parallelism, which is the strategy of dividing the workload into smaller chunks and processing them simultaneously. Recently DeepSeek open sourced their EPLB algorithm : https://github.com/deepseek-ai/EPLB

A central concept in parallelism, especially for models with large parameters (such as mixture-of-experts or MoE models), is expert parallelism, where we distribute expert models (or parts of a model) across different resources. This helps to balance the computational load across multiple GPUs, nodes, or physical experts while optimizing performance and minimizing bottlenecks.

Let's explore the code snippet that handles expert parallelism load balancing through replication and rebalancing in a hierarchical structure.

1. Packing Objects with Balanced Weights

The first step in this parallelism process involves packing objects (data or model components) into groups (or "packs"). This ensures that each group has an approximately equal weight distribution, minimizing the chance of overloading any one pack.

Function: `balanced_packing`

This function's goal is to divide n weighted objects into m groups or "packs" such that the distribution of weight in each pack is as balanced as possible. Here's a breakdown of its core logic:

Input: A tensor weight of size [X, n], where X is the number of layers, and n is the number of objects to pack. Additionally, num_packs specifies how many packs or groups to create.
Output:
- pack_index: Tensor showing which pack each object belongs to.
- rank_in_pack: Tensor showing the rank (or position) of the item within its respective pack.

The function first checks if the number of groups divides evenly into the number of packs. If this is true, it proceeds to sort the objects by weight and assigns them to packs, ensuring that each pack's weight is as balanced as possible.

2. Replicating Experts to Minimize Load

Once the objects are packed, the next step is to replicate these "logical experts" (think of these as model components or neurons) into physical experts. The goal here is to distribute the computational load evenly across physical devices like GPUs.

Function: `replicate_experts`

This function replicates logical experts (model components) into physical experts, minimizing the load imbalance across the physical replicas:

Input:
- weight: A tensor representing the weight of each logical expert.
- num_phy: The total number of physical experts (often corresponds to available GPUs).
Output:
- phy2log: Mapping from physical expert IDs to logical expert IDs.
- rank: The rank (or order) of each replica within its physical expert.
- logcnt: Number of replicas each logical expert has across physical experts.

The idea here is that when we replicate a logical expert across multiple physical devices, we aim to minimize the maximum load (or number of tasks) that each physical expert carries. By distributing the work evenly, we ensure better resource utilization and prevent some devices from being overloaded while others are underutilized.

3. Hierarchical Rebalancing for Optimal Distribution

In more complex scenarios where multiple nodes or GPUs are involved, we need to rethink the way experts are distributed across devices. A hierarchical structure allows for the efficient distribution of logical experts across various levels, starting from nodes down to GPUs, ensuring optimal use of available resources.

Function: `rebalance_experts_hierarchical`

This function deals with rebalancing experts in a hierarchical fashion. It distributes logical experts across multiple server nodes, and within each node, it ensures that they are evenly distributed across GPUs:

Input:
- weight: The load statistics for all logical experts.
- Other parameters define the hardware layout: num_physical_experts, num_groups, num_nodes, num_gpus.
Output:
- physical_to_logical_map: A mapping from physical experts to logical experts.
- logical_to_physical_map: A mapping from logical experts to physical experts.
- logical_count: The count of replicas per logical expert.

4. Main Function: Rebalance Experts

The main function, rebalance_experts, serves as the entry point for balancing the load across multiple replicas, groups, nodes, and GPUs:

Input: Similar to the hierarchical function, this function takes a tensor of weights for logical experts and distributes them across multiple replicas, nodes, and GPUs.
Output:
- physical_to_logical_map: Mapping physical experts to logical ones.
- logical_to_physical_map: Detailed mapping for each logical expert to all its corresponding physical replicas.
- logcnt: Tracks the number of replicas for each logical expert.

Why Is This Approach Useful?

This hierarchical load-balancing strategy is essential when dealing with large-scale models, particularly in mixture-of-experts (MoE) models, where only a subset of experts are active at any given time.

Efficient Resource Usage: By evenly distributing the computational load, this method ensures no single GPU or node becomes a bottleneck.
Scalability: As the number of nodes, GPUs, or experts grows, the system can scale without introducing inefficiencies or resource underutilization.
Minimal Load Imbalance: Using AI-based strategies like packing, ranking, and replicating ensures that the maximum load per physical expert is minimized, improving overall training and inference performance.

Conclusion

Efficient expert parallelism is a cornerstone of scaling AI models across large clusters of GPUs and nodes. This load-balancing strategy, incorporating techniques like balanced packing and hierarchical expert distribution, ensures that computational resources are used optimally, preventing resource contention and ensuring fast and efficient model training. The implementation described is just one approach to managing expert parallelism, but it highlights the importance of balancing the workload in distributed AI systems, especially as model sizes continue to grow.

By applying these methods, AI engineers can take full advantage of multi-GPU and multi-node environments, ensuring that large models can be trained efficiently and at scale.

Course Review : Fundamentals of Reinforcement Learning [Coursera]

Harshit Sharma — Wed, 05 Mar 2025 19:02:38 GMT

Reinforcement Learning is one of the machine learning technique which is getting lot of attention recently. I really get excited when I get to learn something new and exciting and especially if something is related to machine learning.

I have been trying to start learning the fundamentals of reinforcement learning for quite some time and like every other such plan this was just logged in my Bookmark list waiting for an action. Recently University of Alberta started Reinforcement Learning specialization course on Courserawhich allows one to deep dive and understand the fundamentals of reinforcement learning. Instructors Martha White and Adam White did a great job in explaining the fundamentals of reinforcement learning. For someone who is interested in this area of machine learning and waiting to get a kick start, must give this course a try.

So what is reinforcement learning? As we start thinking more and more about it , it more or less feels like our conscience which helps to take actions in our day to day life. Don’t get confused with the word conscience. What I mean is for every action that we take there might be some reward involved in it. For example when we play games on our consoles there is one clear end goal which is to complete mission and our action and movements determines how soon we finish a mission, at the same point of time there is some reward involved for every action or series of action which reinforces the importance of those actions at particular stage.

Likewise in reinforcement learning we have Environment , Action , Reward and State is involved which is very similar to Game(Environment), action or series of actions which we take while playing games (such as movements), points/gems (Rewards) and stages (State) .

Reinforcement learning allows us to improve learning algorithm by trying different types of algorithms which carefully assess the final reward due to series of action which we take in an environment in different states. How can we do it better? How do we frame problems on the same lines such that we can improve them through reinforcement learning.

As you start this course you will understand it requires one to go through reading materials. They give you a confidence and personally this course is one of my favorite, unlike all other courses this course focuses on understanding concept on your own by going through a reading material first and then follow up video which reinforces those concepts. Not sure how many courses out there follow similar course structure but I personally liked this approach.

Week1 focuses on K-Armed Bandit problem and exploration/exploitation tradeoff. Hands down this is like an entry into the world of reinforcement learning. Its very important to set the intuition right and understand why do we need reinforcement learning in the first place .

Week2 focuses on Markov Decision Process (MDP) and different type of tasks such as episodic and continuous task. If you are familiar with reinforcement learning or you further proceed in this course , you will realise MDP is one of the foundation and will never your side.

Week3 introduces Policies and Value functions , Bellman equation and optimalilty. After defining task its also important to calculate value at each state and to calculate if the value is optimal or not. If not we might need to change or try different policy.

Week4 focuses on Dynamic Programming and using same technique for policy evaluation and policy iteration. Followed by assignment to find optimal policy using dynamic programming. I really enjoyed this assignment, Coursera provides model for parking demand with a reward function that reflects its preferences and the task is to determine optimal policy.

After you are done with course you just cannot wait to start next part in the series . Super excited to give it a try and explore more in reinforcement learning.

56: Safety Operating System

Harshit Sharma — Wed, 05 Mar 2025 19:00:55 GMT

Re-imagining Security and Safety

Introduction

At 56, we believe that everyone should have access to world-class physical security. 56 has built the first safety operating system for homes and businesses which leverages an interconnected network of guards and AI smart cameras to deliver an efficient safety experience to our customers. We’re doing this primarily by developing location services, machine learning and hardware running deep learning models. In the last year, we have resolved more than 2,000 SOS calls, installed 1,500 AI cameras and secured thousands of Bengalureans with the help of our platform and network of guards. Our Guardian360 platform takes signals from the connected network of the guards and smart AI cameras to deliver end to end safety experience to our users.

Our eyes and ears aren’t designed to scan an area, looking for suspicious activity. Our perception abilities are too easily manipulated, resulting in a false sense of security. Maintaining private spaces is a demanding task. Keeping an eye on the residents, visitors, customers and employees requires adequate surveillance and security. Manual monitoring of the livestream from the CCTV camera in real time is error prone and adds to cognitive fatigue. In a traditional physical security setup there is no monitoring layer over physical security guards and thus leading to zero visibility on security infrastructure which makes the traditional security eco-system reactive in nature.

We want to transform the way security and process monitoring is achieved, via deep integration between computer vision, advanced location services and real-time physical intervention. It doesn’t matter where you are: on the street, at home or at work: our connected Guardian360 platform provides our users with peace of mind round the clock. Our users also can rest assured that our safety agents are always just a few minutes away.

Safety agents on the ground, AI on the edge camera and Secure App are three main pillars that enable us to give peace of mind to our users.

56combines safety agents , machine learning and video intelligence powered by AI cameras to give users complete protection in real time. Wireless smart AI cameras run deep learning models and support advance recognition of events such as Known / Unknown visitor , Pet and Vehicle detection. Users can setup camera assisted alerts to the guards and request for “SOS” immediate assistance using our application. Network of trained and PSARA certified safety agents are assisted by clustering and dynamic patrol route allocation which ensures availability of the safety agents in a locality. Geofencing and real time video intelligence helps us to track processes and workflows . 56 seamlessly combine it all to give users a peace of mind .

Infrastructure that ingests millions of signals to keep users safe

Our Guardian360 platform uses signal from the connected network of the smart AI cameras and safety agents to deliver end to end safety experience to our users. We process more than a million events in a day. From the engineering side, we’re striving to make 56 reliability a reality 99.99% of the time for our users. With millions of events coming into our platform, it’s important to keep on top of what’s going on across the board.

Context Engine Service : Vision

AI on the edge enabled cameras can proactively detect incidents in real time. Smart AI camera powered by deep learning detect events including person , pet and vehicle with high accuracy. Real time event analysis, anomaly detection using machine learning and 24X7 camera uptime monitoring adds an extra layer of visibility which is very critical when it comes to security. Unlike traditional CCTV our platform can help to deter crimes before they escalate. Vision service assisted with AI camera also enables security related workflows for homes , businesses and apartments.

Intelligent Location Service : Cerebro

Our vision is to not only to keep our users safe in their premises but also act as their first responders whenever they are in need even if they are outside their house. It becomes super critical to monitor and track safety agents in real time . Platform assisted with clustering algorithms leverages location signals to enable risk based dynamic deployment in an area and alerts our in-house war-room team in case of any anomaly.

In case of any emergency , we use our interconnected network of safety agents and broadcast any critical information that can be useful to avert any threat in real time. At the same time we can track and monitor workflows and policies assigned to the guards in apartments and business and report them to the users.

Secure App

Secure App enables users to check and get updates on live security status of their home and their locality. It also helps user to raise “SOS” to the network of safety agents in case of security emergency . Users can also request for specific action, view feeds , play archive and approve / reject visitors. Users can check security status at home & outside , flag security threats and report incidents using app.

If a culture of problem solving resonates with you, and if you like having an impact on millions of people, you might be a great fit for 56. We’re currently hiring for a variety of technical roles, so do check out our Linkedin page for more information.

Impact

Over the past year we managed to create an impact by averting numerous incidents in localities . Check out what our subscribers have to say .

56: How we utilise millions of location signals to provide proactive security

Harshit Sharma — Wed, 05 Mar 2025 18:59:34 GMT

Re-imagining Security and Safety

Introduction

At 56, we leverage an interconnected real-time network of guards and AI smart cameras to deliver an efficient safety experience to our users. We’re doing this primarily by developing location services, algorithms and AI cameras running deep learning models on edge. 56 combines pro-active safety agents on the ground running street specific algorithmic routes and video intelligence powered by AI cameras to give its users complete protection. As a result, 56 is a pro-active and adaptable safety network operating 24X7.

SOS

If you have pressed SOS on the 56Secure app, you might know how simple the process is. Just press our SOS button, and based on your location, our nearest safety agent will reach you within minutes for any safety or security assistance. The process is simple, but a lot is happening behind the scenes.

The secret behind SOS is our platform which leverages our safety agent’s location data for insightful and intelligent decision-making. While safety agents are on the move, millions of events move into our system, which ultimately helps our safety agents to locate and reach our users within minutes.

When a user presses the SOS button on our 56Secure application, we immediately connect the nearest safety agent with the customer. Our war-room team closely monitors the situation. The assigned safety agent who are trained to de-escalate, immediately goes to the location and locates the customer needing safety assistance. At the same time, other safety agents are aware and can be called upon as a backup. Any anomaly or spot reported using our application helps us to deploy the guards based on historical incidents and send optimised patrol routes and workflows to the safety agents.

56 War room : 24X7

Intelligent Location Service: Cerebro

We process over seven million location signals daily to keep our users safe.

Cerebro ingests every single GPS point of over hundreds of safety agents on the ground and stores historical information, which ultimately helps our data science team to deploy and position guards in an area. The Data-Science team at 56 also performs in-depth geospatial data analysis of road distance and traffic to enhance the customer service experience and assist safety agents on the ground in doing their duty effectively. Every day our platform manages millions of location events.

Our pro-priority algorithms use location data and street distance to divide the entire city into a grid of cells and associate information to each cell depending on the density of our users in a locality. This allows a flexible trade-off between precision and efficiency, depending on the tasks, which can be SOS or the scheduling of jobs and services in an area. Metadata associated with each cluster updates at regular intervals, which ultimately helps to assist users and deliver a great customer experience.

Impact

Over the past year, we created an impact by averting numerous incidents in Bangalore in real-time. Check out what our subscribers have to say.

A Connected Approach: The Changing Landscape of Security Solutions

Harshit Sharma — Wed, 05 Mar 2025 18:58:04 GMT

Interconnected Deep Tech Architecture

In the realm of security, one might instinctively think that a camera or a security guard offers proactive protection for users. Yet, in reality, neither option delivers a comprehensive security system on its own. At 56 Secure, as we endeavor to create a cohesive security ecosystem, it’s vital to understand the limitations of traditional security setups and why they often react rather than preempt threats. Let’s delve into why standalone solutions like cameras or security guards fall short of providing proactive security for users.

Limitations of Cameras

Cameras are often seen as an easy and affordable way to provide security to a property. However, cameras have several limitations that make them ineffective as a standalone security measure. First, cameras are passive devices that simply record what’s happening. They don’t have the ability to intervene or prevent an incident from occurring. Second, cameras can be easily disabled or bypassed. For example, an intruder can wear a mask or hat to avoid detection or simply destroy the camera. Third, cameras can only capture what’s in their field of view, which means that blind spots can exist in the coverage area.

Limitations of Security Guards

Security guards are often seen as a more active security solution compared to cameras. However, security guards also have limitations that make them ineffective as a standalone security measure. First, security guards can only cover a limited area at a time, which means that large properties may require multiple guards. Second, security guards are human and can be overpowered or distracted. Third, security guards are expensive and may not be affordable for smaller businesses or individuals.

The Importance of Integrated Security Systems

To provide proactive security, a comprehensive security system is needed. This includes not only cameras and security guards but also intelligent systems, alarms, access control systems, and other security measures. By integrating these systems, it’s possible to create a layered approach to security that provides multiple levels of protection. For example, if a camera detects an intruder, an alarm can be triggered to alert security guards or the user, who can then respond and prevent the intrusion.

AI Cameras

AI-powered CCTV cameras represent a significant leap in security technology. Unlike traditional CCTV systems that passively record footage for later review, AI-enabled cameras can actively analyze and interpret the visual data in real-time. This means they can identify unusual patterns, recognize faces, and even differentiate between types of objects or activities. The immediate advantage of this is the ability to send real-time alerts in the event of suspicious or anomalous activity, allowing for faster response times. Furthermore, with machine learning, these systems can continuously improve their detection algorithms based on new data, thereby reducing false alarms and enhancing accuracy. The integration of AI into CCTV systems, therefore, offers a more proactive and efficient approach to security and surveillance.

56 AI One

As we gear towards launching our flagship camera in the market, we are proud to unveil a groundbreaking integration of artificial intelligence into our CCTV systems. This camera is not just a tool for surveillance but a smart assistant capable of distinguishing between normal and suspicious activities, offering real-time alerts, and making data-driven decisions to enhance security. With advanced image recognition and machine learning algorithms at its core, our AI-powered camera promises to redefine the standards of safety and surveillance. Businesses, institutions, and homeowners can now expect unparalleled security insights, making their environments safer and more responsive than ever before.

Conclusion

In conclusion, a camera or security guard in isolation cannot provide proactive security to users. Cameras are passive devices that can be easily disabled or bypassed, while security guards can only cover a limited area at a time and are human and fallible. To provide effective security, a comprehensive security system that integrates multiple security measures is necessary. By doing so, users can ensure that they have the best possible protection against intrusions and other security threats.

Measuring Efficiency of SOC Neural Network Chips: A Comprehensive Approach

Harshit Sharma — Wed, 05 Mar 2025 18:56:13 GMT

As artificial intelligence continues to revolutionize various industries, the demand for efficient System-on-Chip (SOC) neural network processors has skyrocketed. When designing the 56 AI One camera, selecting the System on Chip (SoC) based on both performance and cost was crucial. Over the last 4 years I got the opportunity to work and evaluate difference SoCs , ranging from Nvidia Jetson to Ambarella CV25. These SoC are designed to accelerate AI workloads, but how do we measure their efficiency? This blog post explores the key metrics and methodologies used to evaluate the performance of SOC neural network chips.

Nvidia Jetson Nano , Jetson Xavier modules

Benchmarking Methodologies

Standard industry benchmarks like MLPerf provide a common ground for comparing different chips across a range of AI tasks, including image classification, object detection, and natural language processing. The MLPerf benchmarks measure consistent performance metrics such as training time to target quality, inference latency, and throughput. This ensures that the performance comparisons are based on well-defined and universally accepted criteria. These benchmarks typically measure both inference and training performance, offering insights into how chips perform in various scenarios. However, standard benchmarks alone may not tell the whole story.

Lessons Learned: Why TOPs Number Isn’t Enough

TOPS (Tera Operations Per Second) has long been a go-to metric for comparing AI chip performance, but it’s increasingly clear that this measure alone falls short in capturing real-world AI chip efficiency. While TOPS provides a quick snapshot of raw computational power, it fails to account for critical factors like memory bandwidth, data movement costs, and the varying computational demands of different AI models. For instance, a chip boasting high TOPS might underperform in practice due to memory bottlenecks or inefficient handling of sparse operations common in many neural networks. Through my experience working with and evaluating various SoCs, from Nvidia Jetson Nano to Ambarella CV25, I’ve gathered significant insights in this domain.

Custom benchmarks tailored to specific use cases are equally important, as they can reveal how a chip performs under real-world conditions relevant to its intended application. For instance, an edge AI chip for autonomous vehicles might be benchmarked using specific computer vision tasks under varying lighting and weather conditions. It’s crucial to employ a diverse set of workloads in testing, encompassing different neural network architectures (CNNs, RNNs, Transformers), model sizes, and precision levels (FP32, FP16, INT8). This diversity ensures a comprehensive evaluation of the chip’s versatility and efficiency across various AI applications. Additionally, benchmarking should consider not just raw performance, but also power consumption, thermal characteristics, and consistency of performance over extended periods. By combining standardized benchmarks with application-specific tests and a wide range of workloads, we can gain a holistic understanding of a SOC Neural Network Chip’s efficiency and suitability for different AI tasks.

Average inference time by device

TOPS : Tera Operations Per Second

The most common metric used in the industry is TOPS (Tera Operations Per Second), which measures the raw computational power of a chip. TOPS indicates how many trillion operations the chip can perform in one second, providing a straightforward comparison of processing capabilities.

However, TOPS alone doesn’t paint the full picture of a chip’s efficiency. That’s where TOPS/W (TOPS per Watt) comes in. This metric combines processing power with energy consumption, offering insight into the chip’s efficiency. A higher TOPS/W indicates better performance per unit of power, which is crucial for mobile and edge devices with limited energy resources. It is important to calculate the TOPS of the SoC and obtain precise measurements once the AI pipelines are deployed.

Calculating TOPS for a Deep Learning Computer Vision Model:

Understand the model architecture: First, you need to know the structure of your model, including the number and types of layers, the size of input tensors, and the number of parameters in each layer.
Count the operations: For each layer in your model, count the number of floating-point operations (FLOPs) required for a single forward pass.

The main operations to consider are:

Convolutions
Matrix multiplications
Activations
Pooling operations

Calculate FLOPs for convolutional layers: For a convolutional layer: FLOPs = 2 H W Cin Cout K K Where:

H and W are the height and width of the output feature map
Cin is the number of input channels
Cout is the number of output channels
K is the kernel size

For a fully connected layer: FLOPs = 2 Nin Nout Where:

Nin is the number of input neurons
Nout is the number of output neurons

Sum up the total FLOPs: Add up the FLOPs from all layers to get the total number of operations for one forward pass of the model.
Consider the inference speed: Determine how many inferences your hardware can perform per second. This is often provided by the hardware manufacturer or can be measured through benchmarking.
Calculate TOPS: TOPS = (Total FLOPs * Inferences per second) / 10¹² This gives you the Tera Operations Per Second.

Example: Let’s say you have a simple CNN with:

1 convolutional layer: 3x3 kernel, 3 input channels, 64 output channels, 224x224 input size
1 fully connected layer: 50,176 inputs, 1000 outputs
The hardware can perform 100 inferences per second

Calculations:

Conv layer FLOPs = 2 224 224 3 64 3 3 = 173,408,256
FC layer FLOPs = 2 50,176 1000 = 100,352,000
Total FLOPs = 173,408,256 + 100,352,000 = 273,760,256
TOPS = (273,760,256 * 100) / 10¹² = 0.0274 TOPS

This method provides a rough estimate of TOPS. In practice, the calculation can be more complex due to factors like optimizations, quantization, and the specific hardware architecture being used.

Note: The TOPS calculation assumes 100 inferences per second. If this assumption is incorrect or needs to be adjusted, the final TOPS value would change accordingly.

Frames per Second (FPS)

While TOPS and TOPS/W are important, they don’t always translate directly to real-world performance. This is where application-specific metrics come into play. For vision-related tasks, Frames per Second (FPS) measures how many image frames a chip can process in real-time, which is particularly relevant for applications like video analytics or autonomous driving. Latency, on the other hand, measures the time taken to complete a single inference, which is critical for real-time applications that require quick responses.

Memory Bandwidth

56 AI One Camera powered by Ambarella CV25

Another crucial aspect of neural network chip efficiency is memory performance. Memory bandwidth can often become a bottleneck in neural network processing, limiting the chip’s ability to fully utilize its computational resources. High memory bandwidth generally allows for better utilization of the chip’s processing power. Additionally, for applications where space is at a premium, area efficiency (measured in TOPS/mm²) becomes an important consideration, relating performance to chip size.

It’s also important to consider the chip’s flexibility and support for different neural network architectures. While not directly related to chip performance, the accuracy achieved on standard benchmarks helps evaluate how well the chip supports various neural network models. Some chips may excel at certain types of networks but perform poorly on others, so it’s crucial to evaluate performance across a range of relevant models and tasks.

Conclusion

Lastly, for commercial applications, the cost-effectiveness of a chip is a critical factor. Performance per dollar helps compare different solutions in terms of their economic viability. When evaluating SOC neural network chips, it’s essential to consider multiple metrics and how they relate to your specific use case. A chip with high TOPS might not be the best choice if power consumption is a primary concern, while a highly energy-efficient chip might fall short in applications requiring high-speed processing. By taking a comprehensive approach and considering various metrics, developers and businesses can make informed decisions when selecting the most suitable SOC neural network chip for their needs.

Unlocking AI Potential: A Deep Dive into NPU, GPU, TPU and FPGA

Harshit Sharma — Wed, 05 Mar 2025 18:52:42 GMT

We have witnessed significant advancements in AI-specific hardware, ranging from the widespread adoption of GPUs to the emergence of specialized NPUs and TPUs. Each development has brought us closer to realizing the full potential of AI at the edge. However, I believe we are on the verge of something even more groundbreaking. Intel has introduced its ‘Lunar Lake’ processors with enhanced NPU capabilities. Meanwhile, outside the ‘PC’ realm, Apple recently announced an upgraded NPU in the M4 chip, which powers their latest products.

The first Apple Neural Engine that debuted within Apple’s A11 chip in 2017’s iPhone X was powerful enough to support Face ID and Animoji.

As we push the boundaries of what’s possible with artificial intelligence at the network edge, I find myself wondering: what truly transformative breakthroughs lie just beyond the horizon? Perhaps the next leap forward will come from novel architectures that fundamentally reimagine how we process AI workloads. Let’s explore the details of Neural Processing Units (NPUs) and compare them with other AI-specific processors, such as Graphics Processing Units (GPUs) and Tensor Processing Units (TPUs).

The Next Wave of AI-Optimized Compute

When choosing the SoC for 56 AI One camera, understanding memory utilization and compute efficiency was crucial for handling multiple AI workflows. Want to know how to measure efficiency in AI SoCs? Read my blog for more details.

In-memory computing tackles the von Neumann bottleneck directly by performing computations within memory units. This approach significantly reduces data movement, resulting in substantial improvements in both speed and energy efficiency for AI workloads. Reconfigurable AI accelerators are emerging. These chips can dynamically adapt their architecture to optimally suit various AI tasks, providing unparalleled flexibility and efficiency across a wide range of applications..

As we strive for more powerful and efficient AI systems, the distinctions between different types of specialized processors are becoming less clear. The future may not lie in discrete GPUs, TPUs, or NPUs, but in highly integrated, multi-paradigm chips that seamlessly combine various computational approaches to address the most challenging AI problems.

NPU : Neural Processing Units

NPUs are specialized chips designed specifically for AI tasks, particularly those involving neural networks. Their architecture is optimized to handle the parallelism and efficiency required for AI computations, often outperforming traditional CPUs and even GPUs in certain tasks.

Key Characteristics of NPUs:

Optimized Parallel Processing: Designed to handle multiple AI operations simultaneously.
Energy Efficiency: Lower power consumption compared to GPUs, making them ideal for mobile and edge computing.
Synaptic Weight Mechanism: This principle enhances learning efficiency by strengthening frequently used pathways, akin to synaptic activity in the human brain.

Some of the NPUs worth checking out : Qualcomm Hexagon, Apple Neural Engine.

TPU : Tensor Processing Unit

TPU : High Level Architecture

In mathematics, a tensor is an algebraic object that describes a multilinear relationship between sets of algebraic objects related to a vector space. Tensors may map between different objects such as vectors, scalars, and even other tensors.

A tensor may be represented as a (potentially multidimensional) array.

So, depending on the nature of the tensor, it can be represented as an array of n dimensions where n is 0,1,2,3 and so on. Some of these representations have more familiar, names:

Dimension 0 — scalar
Dimension 1 — vector
Dimension 2 — matrix

Why is a Tensor Processing Unit so called? Because it is designed to speed up operations involving tensors. Precisely, what operations though? The operations that are referred to in our original Wikipedia definition which described a tensor as a “map (multilinear relationship) between different objects such as vectors, scalars, and even other tensors”.

GPU

Graphics Processing Units (GPUs) have become indispensable in the field of artificial intelligence. Originally designed to accelerate graphics rendering, their highly parallel structure makes them ideally suited for handling the extensive computations required in AI workloads. GPUs excel in performing numerous calculations simultaneously, which is crucial for training deep neural networks and processing large datasets. NVIDIA, a leader in GPU technology, has significantly enhanced the capabilities of GPUs for AI through its CUDA platform, which allows developers to exploit the parallel processing power of GPUs for a variety of machine learning tasks. This flexibility makes GPUs an attractive option for researchers and companies looking to perform both AI training and inference.

Their mature ecosystem, supported by robust software frameworks and tools, has facilitated widespread adoption and continuous optimization. This extensive support network helps in achieving higher performance and ease of use for developers and researchers alike. Despite the emergence of more specialized processors like TPUs and NPUs, GPUs continue to be a cornerstone of AI hardware, balancing flexibility, power, and accessibility.

FPGA ?

Field-Programmable Gate Array (FPGA)

Vaaman, an edge computing board developed by Indian Startup Vicharak ( Computer Hardware Startup)

Field-Programmable Gate Arrays (FPGAs) offer a unique and powerful solution in the AI hardware landscape. Unlike GPUs and TPUs, which have fixed architectures, FPGAs can be reprogrammed to optimize performance for specific tasks, providing a high degree of customization. This reconfigurability allows developers to tailor the hardware to their exact needs, achieving significant performance improvements for specialized AI applications. FPGAs can be optimized for low latency, making them particularly suitable for real-time AI processing where quick response times are critical.

Do checkout Vicharak, their high-performance edge computing board looks promising.

In addition to their adaptability, FPGAs are known for their potential energy efficiency. By customizing the hardware configuration, it is possible to achieve a balance between power consumption and performance that is difficult to match with fixed-architecture processors. This makes FPGAs an attractive choice for edge computing, where energy efficiency is often as important as computational power. Moreover, their flexibility makes them valuable in the prototyping and development stages, allowing for rapid testing and iteration of new AI algorithms before committing to a specific hardware design. As AI applications continue to diversify, the role of FPGAs in providing customizable, efficient, and high-performance solutions is likely to expand.

Summary of Differences:

Functionality: GPUs are versatile, TPUs are specialized for tensor operations, and NPUs are tailored for neural network tasks.
Parallelism: All three excel in parallel processing but differ in their optimization focus.
Use Cases: GPUs dominate in data centers, TPUs in Google’s infrastructure, and NPUs in mobile and edge devices.
Energy Efficiency: TPUs and NPUs are generally more energy-efficient than GPUs, suitable for their respective applications.

Conclusion:

Choosing the right processor for AI applications depends on the specific requirements of the task. GPUs offer flexibility, TPUs provide specialized acceleration for tensor operations, and NPUs deliver efficient neural network processing. As AI technology advances, so will the hardware, driving innovation and efficiency in this exciting field. We do face many limitations while optimizing the AI pipeline for specific NPU/SOC architectures, but we hope for seamless development as the industry moves forward in designing new architectures.

Ai Agents

Agents : Architecture and Design

Workflows

Agents

Summary

Agents : Memory

Conceptual Understanding: Memory Types in AI Agents

Implementation Strategies for AI Agent Memory

Semantic Memory Implementation (Facts & Knowledge)

Episodic Memory Implementation (Past Experiences)

Procedural Memory Implementation (Skills & Behaviors)

Using the LangMem Framework for AI Agent Memory

Summary

DeepGEMM: Clean and Efficient FP8 GEMM Library

Introduction

Background and Motivation

Key Features and Capabilities

Technical Implementation

Performance and Benchmarks

Normal GEMMs for dense models

Grouped GEMMs for MoE models

Integration and Usage

Conclusion

Understanding Load Balancing and Expert Parallelism in AI Models

Expert Parallelism Load Balancer (EPLB)

1. Packing Objects with Balanced Weights

Function: balanced_packing

2. Replicating Experts to Minimize Load

Function: replicate_experts

3. Hierarchical Rebalancing for Optimal Distribution

Function: rebalance_experts_hierarchical

4. Main Function: Rebalance Experts

Why Is This Approach Useful?

Conclusion

Course Review : Fundamentals of Reinforcement Learning [Coursera]

56: Safety Operating System

Introduction

Infrastructure that ingests millions of signals to keep users safe

Context Engine Service : Vision

Intelligent Location Service : Cerebro

Secure App

Impact

56: How we utilise millions of location signals to provide proactive security

Introduction

SOS

Intelligent Location Service: Cerebro

Impact

A Connected Approach: The Changing Landscape of Security Solutions

Limitations of Cameras

Limitations of Security Guards

The Importance of Integrated Security Systems

AI Cameras

56 AI One

Conclusion

Measuring Efficiency of SOC Neural Network Chips: A Comprehensive Approach

Benchmarking Methodologies

Lessons Learned: Why TOPs Number Isn’t Enough

TOPS : Tera Operations Per Second

Frames per Second (FPS)

Memory Bandwidth

Conclusion

Unlocking AI Potential: A Deep Dive into NPU, GPU, TPU and FPGA

The Next Wave of AI-Optimized Compute

NPU : Neural Processing Units

TPU : Tensor Processing Unit

GPU

FPGA ?

Summary of Differences:

Function: `balanced_packing`

Function: `replicate_experts`

Function: `rebalance_experts_hierarchical`