AI Building Blocks

What can you transform? Start from what you have (images, text, audio) and discover which building blocks turn it into what you need. Focus on production-ready solutions, not research.

Featured

From Image

15 blocks

Image Understanding(5)

Image Perception(5)

Image Transformation(5)

From Text

18 blocks

Text Retrieval(3)

Text to Media(4)

Text Generation(3)

Text Analysis(4)

Text Transformation(4)

From Audio

9 blocks
AudioText

Speech Recognition

Transcribe spoken audio into text. The foundation for voice interfaces, meeting transcription, and audio search.

OpenAI Whisper APIWhisper (local)Deepgram+3 more
View implementations →
AudioStructured Data

Audio Classification

Classify audio into categories like music genres, environmental sounds, speaker emotions, or speech commands.

Audio Spectrogram Transformer (AST)Wav2Vec2CLAP+2 more
View implementations →
AudioStructured Data

Voice Activity Detection

Detect when speech is present in audio. Essential preprocessing for ASR, diarization, and voice interfaces.

Silero VADWebRTC VADpyannote VAD+1 more
View implementations →
AudioAudio

Audio Transformation

Transform audio signals: enhance, denoise, separate sources, change voice, or convert music styles.

DemucsRVC (Retrieval Voice Conversion)so-vits-svc+2 more
View implementations →
AudioStructured Data

Speaker Diarization

Separate 'who spoke when' in audio. Vital for meetings, call centers, and transcription QA.

pyannote.audioNVIDIA NeMo DiarizationResemblyzer
View implementations →
AudioStructured Data

Keyword Spotting

Detect wake words and short commands with low latency and tiny footprints.

Google Speech Commands KWSopenWakeWordPicovoice Porcupine
View implementations →
AudioStructured Data

Speech Emotion Recognition

Classify speaker emotion or affective state from voice.

SpeechBrain SERWav2Vec2-EmotionEmo-CLAP
View implementations →
AudioAudio

Voice Cloning

Replicate a speaker’s voice or convert one voice to another (TTS-to-TTS).

RVCso-vits-svcOpenVoice
View implementations →
AudioStructured Data

Audio Watermark Detection

Detect or verify watermarks in synthetic or distributed audio.

AudiowmarkAudioSeal DetectorStable Signature (beta)
View implementations →

From Video

5 blocks

From Document

3 blocks

Common Pipelines

Pre-built combinations of building blocks for common use cases.

Direct Visual Search

Embed images directly with CLIP/SigLIP, search by text or image query.

Good for:
  • Photo library search
  • E-commerce visual search
Pros:
  • Real-time indexing
  • Text-to-image search

Caption + RAG Visual Search

Generate captions for images, embed captions, search via text RAG.

Good for:
  • Detailed scene search
  • Accessibility-first apps
Pros:
  • Human-readable index
  • Can describe complex scenes

Document RAG Pipeline

Extract text from documents, chunk, embed, retrieve, generate with LLM.

Good for:
  • Enterprise search
  • Legal document QA
Pros:
  • Grounds LLM in your data
  • Citable sources

Voice Assistant Pipeline

Speech-to-text, process with LLM, text-to-speech response.

Good for:
  • Voice assistants
  • Call center bots
Pros:
  • Natural interaction
  • Hands-free

Example: Text Search in Photo Database

You have thousands of photos and want to search them with text queries like "sunset at the beach" or "birthday party with cake". Here are your options:

FASTER

Direct CLIP Embedding

Embed images directly with CLIP/SigLIP. Text queries are embedded in the same space. Simple, real-time capable.

Best for: General visual concepts, fast indexing, product similarity

RICHER

Caption + Text RAG

Generate detailed captions with a VLM, then use text embedding for search. More descriptive, human-readable index.

Best for: Complex scene descriptions, debugging, accessibility requirements

Missing a building block? Have benchmark results to share?

Contribute Data