AI Building Blocks

Text→Vector

OpenAI text-embedding-3-largeCohere embed-v3Voyage AI voyage-3

Convert text into dense vector representations for semantic search, clustering, and retrieval.

Quality Lane

Validation, eval reruns, score gates

Control Lane

Retries, fallbacks, route selection

Monitoring Lane

Cost, latency, saturation watchpoints

Caption + RAG Visual Search

Generate captions for images, embed captions, search via text RAG.

Assemblers2

Implementations11

Modalities3

Complexity Index59

Input BusOutput Depot

Assembler 15 impls

Image→Text

Image to Text

Generate natural language descriptions of image content. Enables text-based search over visual content.

GPT-4 VisionClaude 3.5 SonnetLLaVA

Text→Vector

OpenAI text-embedding-3-largeCohere embed-v3Voyage AI voyage-3

Convert text into dense vector representations for semantic search, clustering, and retrieval.

Quality Lane

Validation, eval reruns, score gates

Control Lane

Retries, fallbacks, route selection

Monitoring Lane

Cost, latency, saturation watchpoints

Document RAG Pipeline

Extract text from documents, chunk, embed, retrieve, generate with LLM.

Assemblers3

Implementations17

Modalities4

Complexity Index85

Input BusOutput Depot

Assembler 15 impls

Document→Structured Data

Document to Structured

Extract structured information from documents like PDFs, invoices, forms, and contracts.

Docling (IBM)Unstructured.ioAzure Document Intelligence

Text→Vector

OpenAI text-embedding-3-largeCohere embed-v3Voyage AI voyage-3

Convert text into dense vector representations for semantic search, clustering, and retrieval.

Assembler 36 impls

Text→Text

Text to Text

Transform, generate, or reason about text. The core building block for chatbots, summarization, translation, and more.

GPT-4oClaude 3.5 SonnetGemini 1.5 Pro

Quality Lane

Validation, eval reruns, score gates

Control Lane

Retries, fallbacks, route selection

Monitoring Lane

Cost, latency, saturation watchpoints

Voice Assistant Pipeline

Speech-to-text, process with LLM, text-to-speech response.

Assemblers3

Implementations16

Modalities2

Complexity Index68

Input BusOutput Depot

Assembler 16 impls

Audio→Text

Audio to Text

Transcribe spoken audio into text. The foundation for voice interfaces, meeting transcription, and audio search.

OpenAI Whisper APIWhisper (local)Deepgram

GPT-4oClaude 3.5 SonnetGemini 1.5 Pro

Text→Text

Text to Text

Transform, generate, or reason about text. The core building block for chatbots, summarization, translation, and more.

Assembler 34 impls

Text→Audio

Text to Audio

Convert text to natural-sounding speech. Powers voice assistants, audiobooks, and accessibility features.

ElevenLabsOpenAI TTSCoqui XTTS

Quality Lane

Validation, eval reruns, score gates

Control Lane

Retries, fallbacks, route selection

Monitoring Lane

Cost, latency, saturation watchpoints

From Image

15 blocks

Image Understanding(5)

Vector

Image Embedding

Convert images directly to dense vector representations for semantic search, clustering, and similarity matching.

Image Captioning

Generate natural language descriptions of image content. Enables text-based search over visual content.

Visual Question Answering

Answer natural language questions about images. Combines vision and language understanding.

Optical Character Recognition

Detect and read text in images and documents. Core for document intake, receipts, and scene text search.

Chart and Table Understanding

Parse charts, diagrams, and tables into structured data for analysis and QA.

Image Perception(5)

Bounding Boxes

Object Detection

Locate and classify objects in images with bounding boxes. Foundational for autonomous vehicles, surveillance, and robotics.

Segmentation Mask

Image Segmentation

Classify each pixel in an image. Enables precise object boundaries for medical imaging, autonomous vehicles, and image editing.

Depth Map

Depth Estimation

Predict depth from a single image. Critical for 3D reconstruction, AR/VR, and robotics.

Pose Estimation

Detect human or object keypoints. Enables AR overlays, sports analytics, and motion capture.

Optical Flow

Estimate pixel-wise motion between frames. Useful for video editing, stabilization, and robotics.

Image Transformation(5)

3D Model

Image to 3D

Generate 3D models from single or multiple images. Powers 3D asset creation, VR/AR, and e-commerce.

Video

Image to Video

Animate still images into videos. Bring photos to life with natural motion.

Image Transformation

Transform images: style transfer, inpainting, super-resolution, editing, or generation from image prompts.

Background Removal

Segment foreground and remove or replace backgrounds for product photos and portraits.

Face Anonymization

Blur, mask, or re-synthesize faces to protect privacy in images and video frames.

From Text

18 blocks

Text Retrieval(3)

Vector

Text Embedding

Convert text into dense vector representations for semantic search, clustering, and retrieval.

Cross-Encoder Reranking

Re-score retrieved passages with a cross-encoder to boost search precision.

Hybrid Sparse + Dense Retrieval

Combine lexical (BM25) and dense retrieval with weighted fusion or cascades to improve recall and precision for search and RAG.

Text to Media(4)

Image Generation

Hallucination Detection

Score or flag generated text for factuality and grounding.

Text Transformation(4)

Machine Translation

Translate text between languages. Essential for global communication, localization, and cross-lingual applications.

Text Summarization

Condense long documents into concise summaries. Essential for news aggregation, research, and document processing.

AV-Separation (MS3)SpeechSplit + Visual ConditioningNeMo AV-Diarization

Question Answering

Separate or enhance speech in videos using both audio and lip cues. Improves meeting transcription, TV/movie captioning, and noisy recordings.

LlamaIndexLangChainHaystack+5 more

View implementations →

From Opportunity

1 blocks

Opportunity→Strategy

K-Framework: Tornado Test

A strategic framework for service companies to identify technology tornados and become the dominant Service Gorilla. Based on Geoffrey Moore's 'Inside the Tornado' methodology, adapted for the modern tech services market.

Inside the TornadoCrossing the Chasm

View implementations →

Common Pipelines

Pre-built combinations of building blocks for common use cases.

Direct Visual Search

Embed images directly with CLIP/SigLIP, search by text or image query.

Image to Vector→

Good for:

Photo library search
E-commerce visual search

Pros:

Real-time indexing
Text-to-image search

Caption + RAG Visual Search

Generate captions for images, embed captions, search via text RAG.

Image to Text→