STT API Documentation

Real-time speech-to-text API powered by Cohere Transcribe (8x GPU). Supports streaming WebSocket, file upload, batch processing, speaker diarization, emotion detection, and more.

Web UI API Docs Health Active Streams Download Client

Endpoints Overview

Method	Path	Description
GET	`/`	Web UI with live mic transcription
GET	`/docs`	This documentation page
GET	`/health`	Server status, GPU count, active streams
GET	`/streams`	List all active WebSocket streams
GET	`/streams/{id}`	Stream detail with last 50 transcriptions
GET	`/client.py`	Download Python streaming client
POST	`/v1/audio/transcriptions`	Transcribe a single audio file
POST	`/v1/audio/transcriptions/batch`	Transcribe multiple files at once
WS	`/ws/transcribe`	Real-time streaming transcription

1. File Transcription

POST /v1/audio/transcriptions

Upload an audio file and get transcription back. OpenAI-compatible endpoint.

Parameter	Type	Default	Description
`file`	file	required	Audio file (wav, mp3, ogg, webm, mp4, flac, etc.)
`language`	string	`en`	Language: en, pl, fr, de, it, es, pt, el, nl, zh, ja, ko, vi, ar
`punctuation`	bool	`true`	Add punctuation to output
`response_format`	string	`json`	`json`, `verbose_json` (with segments), `text` (plain)
`min_confidence`	float	`0`	Confidence threshold 0-1. Below this → empty text
`timestamps`	bool	`false`	Enable model-native timestamps
`diarize`	bool	`false`	Enable speaker diarization
`itn`	bool	`false`	Inverse text normalization ("twenty three" → "23")
`detect_emotion`	bool	`false`	Detect emotion: happy, sad, angry, neutral

Examples

# Basic transcription (Polish)
curl -X POST https://stt.mm.mk/v1/audio/transcriptions \
  -F "file=@recording.mp3" \
  -F "language=pl"

# With all features
curl -X POST https://stt.mm.mk/v1/audio/transcriptions \
  -F "file=@meeting.wav" \
  -F "language=en" \
  -F "diarize=true" \
  -F "detect_emotion=true" \
  -F "itn=true" \
  -F "timestamps=true" \
  -F "response_format=verbose_json"

# Plain text output
curl -X POST https://stt.mm.mk/v1/audio/transcriptions \
  -F "file=@speech.ogg" \
  -F "language=pl" \
  -F "response_format=text"

Response (json)

{
  "text": "Pani Przewodnicząca, jeszcze jedno pytanie.",
  "language": "pl",
  "duration": 5.2,
  "processing_time": 0.85,
  "rtf": 0.163,
  "confidence": 0.9234,
  "avg_logprob": -0.08,
  "words": [
    {"word": "Pani", "confidence": 0.95, "avg_logprob": -0.05, "start": 0.0, "end": 0.62},
    {"word": "Przewodnicząca,", "confidence": 0.89, "avg_logprob": -0.12, "start": 0.62, "end": 2.41},
    {"word": "jeszcze", "confidence": 0.94, "avg_logprob": -0.06, "start": 2.41, "end": 3.25},
    {"word": "jedno", "confidence": 0.91, "avg_logprob": -0.09, "start": 3.25, "end": 3.85},
    {"word": "pytanie.", "confidence": 0.93, "avg_logprob": -0.07, "start": 3.85, "end": 5.2}
  ],
  "emotion": "neutral",
  "speakers": [
    {"speaker": "spk0", "text": "Pani Przewodnicząca, jeszcze jedno pytanie."}
  ]
}

Note: emotion and speakers only appear when their respective features are enabled. words always includes per-word confidence and estimated timestamps.

2. Batch Transcription

POST /v1/audio/transcriptions/batch

Upload multiple files at once. Same parameters as single transcription, but use files (plural) for multiple file uploads.

# Batch: 3 files
curl -X POST https://stt.mm.mk/v1/audio/transcriptions/batch \
  -F "files=@a.mp3" \
  -F "files=@b.wav" \
  -F "files=@c.ogg" \
  -F "language=pl" \
  -F "diarize=true"

Response

[
  {"filename": "a.mp3", "text": "...", "duration": 12.5, "confidence": 0.92, "words": [...], ...},
  {"filename": "b.wav", "text": "...", "duration": 8.3, "confidence": 0.88, "words": [...], ...},
  {"filename": "c.ogg", "error": "Could not decode audio"}
]

3. WebSocket Streaming

WS /ws/transcribe

Real-time streaming transcription. Client sends raw PCM audio, server returns progressive transcription updates.

Connection URL

wss://stt.mm.mk/ws/transcribe?language=pl&rate=16000&stream_id=kitchen

Query Param	Default	Description
`language`	`en`	Language code
`rate`	`16000`	Sample rate of incoming audio
`stream_id`	`anon`	Identifier for this stream (e.g. room name)
`vad`	`0`	Set to `1` for server-side VAD (for dumb clients like ESP32)
`timestamps`	`0`	Enable timestamps
`diarize`	`0`	Enable speaker diarization
`itn`	`0`	Enable inverse text normalization
`detect_emotion`	`0`	Enable emotion detection
`partials`	`1`	Set to `0` to suppress interim `partial` messages — receive only `final` results (use with `vad=1` for live per-utterance finals)
`incremental`	`0`	Set to `1` for append-only word streaming: each newly-heard word is emitted immediately as a `word` message and never rewritten. Lower quality (no retroactive corrections) but a true live stream. Implies `partials=0`; `final` still marks segment end.

Protocol

Direction	Format	Content
Client → Server	Binary	Raw PCM int16 mono audio at specified sample rate
Client → Server	Binary (empty)	Empty buffer = "end of stream" signal
Server → Client	JSON text	Transcription updates (see below)

Server Messages

// Partial — live update, text may change as more audio arrives
{"type": "partial", "text": "Pani Przewodnicząca", "duration": 2.5,
 "confidence": 0.85, "words": [...]}

// Final — segment complete, text is locked
{"type": "final", "text": "Pani Przewodnicząca, jeszcze jedno pytanie.",
 "duration": 5.2, "confidence": 0.92, "words": [...],
 "emotion": "neutral", "speakers": [...]}

Sliding window: Without server VAD, the server re-transcribes the growing audio buffer every 0.5s, sending partial updates. After ~10s it sends final and starts a new segment. With vad=1, the server detects speech segments automatically and sends final for each utterance.

Two Modes

Mode 1: Smart client (with client-side VAD) — client runs Silero VAD locally, only sends speech audio. Best for Raspberry Pi with Python.

Mode 2: Dumb client (with server-side VAD, vad=1) — client sends ALL audio continuously, server filters silence. Best for ESP32, microcontrollers, or simple scripts.

4. Client Examples

Python (Raspberry Pi / Linux)

# Download the full-featured client with VAD
curl -o stt_client.py https://stt.mm.mk/client.py

# Install dependencies
pip install websocket-client numpy sounddevice torch

# Run with defaults (Polish, local VAD)
python3 stt_client.py --stream-id kitchen --language pl

# All features enabled
python3 stt_client.py -s bedroom -l pl --emotion --diarize --itn --timestamps -v

# List microphone devices
python3 stt_client.py --list-devices

# Use specific mic device
python3 stt_client.py --device 2 -s office

Python (Minimal — no VAD, ~20 lines)

import websocket, sounddevice as sd, numpy as np, json, sys

ws = websocket.WebSocket()
ws.connect("wss://stt.mm.mk/ws/transcribe?language=pl&rate=16000&stream_id=myroom&vad=1")

def callback(indata, frames, time, status):
    pcm = (indata[:, 0] * 32767).astype(np.int16)
    ws.send_binary(pcm.tobytes())

with sd.InputStream(samplerate=16000, channels=1, dtype='float32',
                    blocksize=4096, callback=callback):
    print("Streaming... Ctrl+C to stop")
    while True:
        msg = json.loads(ws.recv())
        if msg.get("text"):
            prefix = "FINAL" if msg["type"] == "final" else "..."
            print(f"[{prefix}] {msg['text']}")

ESP32 / Arduino (C pseudocode)

// Connect to WebSocket with server-side VAD
// wss://stt.mm.mk/ws/transcribe?language=pl&rate=16000&stream_id=esp-kitchen&vad=1

// 1. Init I2S microphone at 16kHz mono
i2s_config_t cfg = { .sample_rate = 16000, .bits_per_sample = 16, .channel_format = MONO };

// 2. In loop: read I2S → send over WebSocket
while (true) {
    int16_t buffer[512];
    i2s_read(I2S_NUM_0, buffer, sizeof(buffer), &bytes_read, portMAX_DELAY);
    ws.sendBinary((uint8_t*)buffer, bytes_read);  // raw PCM int16

    // 3. Check for incoming text messages
    if (ws.available()) {
        String msg = ws.readString();
        // Parse JSON: {"type":"final","text":"...","confidence":0.92}
        // Display on OLED, send to MQTT, trigger automation, etc.
    }
}

curl (one-shot file)

# Simple
curl -X POST https://stt.mm.mk/v1/audio/transcriptions \
  -F "file=@audio.mp3" -F "language=pl"

# Full featured
curl -X POST https://stt.mm.mk/v1/audio/transcriptions \
  -F "file=@meeting.wav" -F "language=en" \
  -F "diarize=true" -F "detect_emotion=true" -F "itn=true" \
  -F "response_format=verbose_json"

Node.js WebSocket

const WebSocket = require('ws');
const ws = new WebSocket('wss://stt.mm.mk/ws/transcribe?language=pl&rate=16000&stream_id=node-client&vad=1');

ws.on('message', (data) => {
  const msg = JSON.parse(data);
  if (msg.text) console.log(`[${msg.type}] ${msg.text} (${(msg.confidence*100).toFixed(0)}%)`);
});

// Send PCM int16 audio buffers:
// ws.send(pcmBuffer);  // Buffer of int16 samples at 16kHz mono

5. Monitoring

GET /health

Basic health check — model loaded, GPU count, active streams.

GET /health/deep

Deep health check — runs a 1-second transcription test to verify the full pipeline (model, processor, VAD). Returns 503 if any check fails.

curl https://stt.mm.mk/health/deep

curl https://stt.mm.mk/health
{
  "status": "ok",
  "model": "CohereLabs/cohere-transcribe-03-2026",
  "gpu_count": 8,
  "languages": ["en","fr","de","it","es","pt","el","nl","pl","zh","ja","ko","vi","ar"],
  "active_streams": 3
}

GET /streams

curl https://stt.mm.mk/streams
{
  "kitchen": {
    "connected_since": 3600,
    "language": "pl",
    "features": {"timestamps": false, "diarize": false, "itn": false, "emotion": true, "server_vad": true},
    "last_text": "podaj mi sól",
    "last_text_ago": 12.5,
    "total_segments": 47,
    "total_audio_sec": 285.3
  },
  "office": { ... }
}

GET /streams/{stream_id}

curl https://stt.mm.mk/streams/kitchen
{
  "stream_id": "kitchen",
  "connected_since": 3600,
  "total_segments": 47,
  "total_audio_sec": 285.3,
  "transcriptions": [
    {"text": "podaj mi sól", "duration": 2.1, "confidence": 0.94, "emotion": "neutral", "ago": 12.5},
    {"text": "dziękuję", "duration": 1.3, "confidence": 0.97, "emotion": "happy", "ago": 45.2},
    ...
  ]
}

6. Features Reference

Feature	API param	WS param	Description
Timestamps	`timestamps=true`	`timestamps=1`	Model-native timestamps in output
Diarization	`diarize=true`	`diarize=1`	Speaker diarization via pyannote community-1 (runs on separate GPU). Applied on finalized segments only.
ITN	`itn=true`	`itn=1`	"twenty three" → "23"
Emotion	`detect_emotion=true`	`detect_emotion=1`	happy, sad, angry, neutral
Server VAD	N/A	`vad=1`	Server-side voice activity detection (Silero VAD)
Confidence filter	`min_confidence=0.3`	N/A	Filter low-confidence hallucinations
Per-word confidence	always on	always on	`words[].confidence` (0-1)

Server-side VAD (Voice Activity Detection)

When vad=1 is set, the server runs Silero VAD on incoming audio. Only speech segments are passed to the transcription model — silence is discarded. This is ideal for always-on microphones (ESP32, Raspberry Pi) where the client can't run VAD locally.

WS Param	Default	Range	Description
`vad=1`	`0`	0 or 1	Enable server-side VAD
`vad_threshold`	`0.3`	0.1 — 0.9	Speech probability threshold. Lower = catches quieter speech. Try 0.15 for soft voice, 0.5 for noisy environments.
`vad_pad_ms`	`400`	100 — 2000	Milliseconds of silence to keep after speech ends. Higher = fewer split segments.
`vad_min_ms`	`100`	50 — 1000	Minimum speech duration to trigger transcription. Filters out clicks/pops.

The server also sends real-time VAD status via vad_status messages (see below), showing current speech probability and state.

VAD Status Messages (WebSocket)

When server VAD is enabled, the server periodically sends status updates so the client can display a live VAD meter:

// Sent every ~250ms when VAD is active
{"type": "vad_status", "speech_prob": 0.87, "is_speech": true, "buffered_ms": 1250}
{"type": "vad_status", "speech_prob": 0.02, "is_speech": false, "buffered_ms": 0}

Field	Type	Description
`speech_prob`	float 0-1	Current speech probability from Silero VAD
`is_speech`	bool	Whether VAD considers this as speech (prob >= threshold)
`buffered_ms`	int	Milliseconds of speech audio currently buffered

Example: Quiet Voice Setup

wss://stt.mm.mk/ws/transcribe?language=pl&rate=16000&stream_id=bedroom&vad=1&vad_threshold=0.15&vad_pad_ms=600&vad_min_ms=80

Hallucination Filtering

The server automatically filters known hallucination phrases ("Thank you.", "Thanks for watching.", etc.) that Whisper-family models produce on silence/noise. Additionally, suspiciously short text for long audio (<1 word per 5s) is filtered. Use min_confidence for explicit threshold.

Audio Format

Property	File upload	WebSocket streaming
Format	Any (wav, mp3, ogg, webm, mp4, flac, ...)	Raw PCM int16 little-endian
Sample rate	Any (auto-resampled to 16kHz)	Specified via `rate` param (default: 16000)
Channels	Any (auto-mixed to mono)	Mono only
Max size	~500MB	Unlimited (streaming)

Supported Languages

en English, fr French, de German, it Italian, es Spanish, pt Portuguese, el Greek, nl Dutch, pl Polish, zh Chinese, ja Japanese, ko Korean, vi Vietnamese, ar Arabic

7. Architecture Notes

Model: Cohere Transcribe 03-2026, 8x GPU with pipeline parallelism
Throughput: ~0.5s inference per 3s audio → ~6 concurrent real-time streams
Inference: Runs in asyncio.to_thread() — WS event loop is never blocked
VAD: Silero VAD on CPU — lightweight, doesn't compete with GPU inference
Sliding window: Without VAD, re-transcribes every 0.5s with growing audio (ElevenLabs-style progressive refinement). Finalizes after ~10s.
With VAD: Transcribes only detected speech segments. Better for always-on microphones.

6. Speaker Enrollment & Recognition

Enroll speaker voiceprints to identify who is speaking during diarization. Uses speechbrain/spkrec-ecapa-voxceleb embeddings.

POST /speakers/enroll

Enroll a speaker from raw PCM int16 mono audio (3+ seconds). Multiple enrollments improve accuracy.

# Record 5s and enroll
ffmpeg -f alsa -i default -t 5 -ar 16000 -ac 1 -f s16le - | \
  curl -X POST "https://stt.mm.mk/speakers/enroll?name=Pawel&rate=16000" \
    -H "Content-Type: application/octet-stream" --data-binary @-

# Response:
{"status": "enrolled", "name": "Pawel", "duration": 5.0, "samples": 1}

POST /speakers/enroll/file

Enroll from an audio file (WAV, MP3, etc.).

curl -X POST "https://stt.mm.mk/speakers/enroll/file" \
  -F "name=Pawel" -F "file=@voice_sample.wav"

GET /speakers

List all enrolled speakers.

curl https://stt.mm.mk/speakers
{"speakers": [{"name": "Pawel", "samples": 3, "duration": 15.0}], "total": 1}

DELETE /speakers/{name}

Remove an enrolled speaker.

curl -X DELETE https://stt.mm.mk/speakers/Pawel

POST /speakers/identify

Identify a speaker from an audio sample (without diarization).

curl -X POST "https://stt.mm.mk/speakers/identify?rate=16000" \
  -H "Content-Type: application/octet-stream" --data-binary @audio.raw

{"identified": "Pawel", "confidence": 0.87, "threshold": 0.55, "all_scores": {"Pawel": 0.87}}

How it works

Enrollment: Upload 3+ seconds of clear single-speaker audio. Multiple samples are averaged for better accuracy.
Recognition: When diarization is enabled, each speaker segment is compared against enrolled voiceprints using cosine similarity.
Threshold: Default 0.55. Speakers below threshold are shown as SPEAKER_XX. Adjust via POST /speakers/threshold?threshold=0.6.
In WebSocket: When diarization is enabled and speakers are enrolled, identified_as field appears in speaker segments.

stt.mm.mk — Cohere Transcribe API | Web UI | Status