STT API Documentation

Real-time speech-to-text API powered by Cohere Transcribe (8x GPU). Supports streaming WebSocket, file upload, batch processing, speaker diarization, emotion detection, and more.


Endpoints Overview

MethodPathDescription
GET/Web UI with live mic transcription
GET/docsThis documentation page
GET/healthServer status, GPU count, active streams
GET/streamsList all active WebSocket streams
GET/streams/{id}Stream detail with last 50 transcriptions
GET/client.pyDownload Python streaming client
POST/v1/audio/transcriptionsTranscribe a single audio file
POST/v1/audio/transcriptions/batchTranscribe multiple files at once
WS/ws/transcribeReal-time streaming transcription

1. File Transcription

POST /v1/audio/transcriptions

Upload an audio file and get transcription back. OpenAI-compatible endpoint.

ParameterTypeDefaultDescription
filefilerequiredAudio file (wav, mp3, ogg, webm, mp4, flac, etc.)
languagestringenLanguage: en, pl, fr, de, it, es, pt, el, nl, zh, ja, ko, vi, ar
punctuationbooltrueAdd punctuation to output
response_formatstringjsonjson, verbose_json (with segments), text (plain)
min_confidencefloat0Confidence threshold 0-1. Below this → empty text
timestampsboolfalseEnable model-native timestamps
diarizeboolfalseEnable speaker diarization
itnboolfalseInverse text normalization ("twenty three" → "23")
detect_emotionboolfalseDetect emotion: happy, sad, angry, neutral

Examples

# Basic transcription (Polish)
curl -X POST https://stt.mm.mk/v1/audio/transcriptions \
  -F "file=@recording.mp3" \
  -F "language=pl"

# With all features
curl -X POST https://stt.mm.mk/v1/audio/transcriptions \
  -F "file=@meeting.wav" \
  -F "language=en" \
  -F "diarize=true" \
  -F "detect_emotion=true" \
  -F "itn=true" \
  -F "timestamps=true" \
  -F "response_format=verbose_json"

# Plain text output
curl -X POST https://stt.mm.mk/v1/audio/transcriptions \
  -F "file=@speech.ogg" \
  -F "language=pl" \
  -F "response_format=text"

Response (json)

{
  "text": "Pani Przewodnicząca, jeszcze jedno pytanie.",
  "language": "pl",
  "duration": 5.2,
  "processing_time": 0.85,
  "rtf": 0.163,
  "confidence": 0.9234,
  "avg_logprob": -0.08,
  "words": [
    {"word": "Pani", "confidence": 0.95, "avg_logprob": -0.05, "start": 0.0, "end": 0.62},
    {"word": "Przewodnicząca,", "confidence": 0.89, "avg_logprob": -0.12, "start": 0.62, "end": 2.41},
    {"word": "jeszcze", "confidence": 0.94, "avg_logprob": -0.06, "start": 2.41, "end": 3.25},
    {"word": "jedno", "confidence": 0.91, "avg_logprob": -0.09, "start": 3.25, "end": 3.85},
    {"word": "pytanie.", "confidence": 0.93, "avg_logprob": -0.07, "start": 3.85, "end": 5.2}
  ],
  "emotion": "neutral",
  "speakers": [
    {"speaker": "spk0", "text": "Pani Przewodnicząca, jeszcze jedno pytanie."}
  ]
}
Note: emotion and speakers only appear when their respective features are enabled. words always includes per-word confidence and estimated timestamps.

2. Batch Transcription

POST /v1/audio/transcriptions/batch

Upload multiple files at once. Same parameters as single transcription, but use files (plural) for multiple file uploads.

# Batch: 3 files
curl -X POST https://stt.mm.mk/v1/audio/transcriptions/batch \
  -F "files=@a.mp3" \
  -F "files=@b.wav" \
  -F "files=@c.ogg" \
  -F "language=pl" \
  -F "diarize=true"

Response

[
  {"filename": "a.mp3", "text": "...", "duration": 12.5, "confidence": 0.92, "words": [...], ...},
  {"filename": "b.wav", "text": "...", "duration": 8.3, "confidence": 0.88, "words": [...], ...},
  {"filename": "c.ogg", "error": "Could not decode audio"}
]

3. WebSocket Streaming

WS /ws/transcribe

Real-time streaming transcription. Client sends raw PCM audio, server returns progressive transcription updates.

Connection URL

wss://stt.mm.mk/ws/transcribe?language=pl&rate=16000&stream_id=kitchen
Query ParamDefaultDescription
languageenLanguage code
rate16000Sample rate of incoming audio
stream_idanonIdentifier for this stream (e.g. room name)
vad0Set to 1 for server-side VAD (for dumb clients like ESP32)
timestamps0Enable timestamps
diarize0Enable speaker diarization
itn0Enable inverse text normalization
detect_emotion0Enable emotion detection

Protocol

DirectionFormatContent
Client → ServerBinaryRaw PCM int16 mono audio at specified sample rate
Client → ServerBinary (empty)Empty buffer = "end of stream" signal
Server → ClientJSON textTranscription updates (see below)

Server Messages

// Partial — live update, text may change as more audio arrives
{"type": "partial", "text": "Pani Przewodnicząca", "duration": 2.5,
 "confidence": 0.85, "words": [...]}

// Final — segment complete, text is locked
{"type": "final", "text": "Pani Przewodnicząca, jeszcze jedno pytanie.",
 "duration": 5.2, "confidence": 0.92, "words": [...],
 "emotion": "neutral", "speakers": [...]}
Sliding window: Without server VAD, the server re-transcribes the growing audio buffer every 0.5s, sending partial updates. After ~10s it sends final and starts a new segment. With vad=1, the server detects speech segments automatically and sends final for each utterance.

Two Modes

Mode 1: Smart client (with client-side VAD) — client runs Silero VAD locally, only sends speech audio. Best for Raspberry Pi with Python.

Mode 2: Dumb client (with server-side VAD, vad=1) — client sends ALL audio continuously, server filters silence. Best for ESP32, microcontrollers, or simple scripts.


4. Client Examples

Python (Raspberry Pi / Linux)

# Download the full-featured client with VAD
curl -o stt_client.py https://stt.mm.mk/client.py

# Install dependencies
pip install websocket-client numpy sounddevice torch

# Run with defaults (Polish, local VAD)
python3 stt_client.py --stream-id kitchen --language pl

# All features enabled
python3 stt_client.py -s bedroom -l pl --emotion --diarize --itn --timestamps -v

# List microphone devices
python3 stt_client.py --list-devices

# Use specific mic device
python3 stt_client.py --device 2 -s office

Python (Minimal — no VAD, ~20 lines)

import websocket, sounddevice as sd, numpy as np, json, sys

ws = websocket.WebSocket()
ws.connect("wss://stt.mm.mk/ws/transcribe?language=pl&rate=16000&stream_id=myroom&vad=1")

def callback(indata, frames, time, status):
    pcm = (indata[:, 0] * 32767).astype(np.int16)
    ws.send_binary(pcm.tobytes())

with sd.InputStream(samplerate=16000, channels=1, dtype='float32',
                    blocksize=4096, callback=callback):
    print("Streaming... Ctrl+C to stop")
    while True:
        msg = json.loads(ws.recv())
        if msg.get("text"):
            prefix = "FINAL" if msg["type"] == "final" else "..."
            print(f"[{prefix}] {msg['text']}")

ESP32 / Arduino (C pseudocode)

// Connect to WebSocket with server-side VAD
// wss://stt.mm.mk/ws/transcribe?language=pl&rate=16000&stream_id=esp-kitchen&vad=1

// 1. Init I2S microphone at 16kHz mono
i2s_config_t cfg = { .sample_rate = 16000, .bits_per_sample = 16, .channel_format = MONO };

// 2. In loop: read I2S → send over WebSocket
while (true) {
    int16_t buffer[512];
    i2s_read(I2S_NUM_0, buffer, sizeof(buffer), &bytes_read, portMAX_DELAY);
    ws.sendBinary((uint8_t*)buffer, bytes_read);  // raw PCM int16

    // 3. Check for incoming text messages
    if (ws.available()) {
        String msg = ws.readString();
        // Parse JSON: {"type":"final","text":"...","confidence":0.92}
        // Display on OLED, send to MQTT, trigger automation, etc.
    }
}

curl (one-shot file)

# Simple
curl -X POST https://stt.mm.mk/v1/audio/transcriptions \
  -F "file=@audio.mp3" -F "language=pl"

# Full featured
curl -X POST https://stt.mm.mk/v1/audio/transcriptions \
  -F "file=@meeting.wav" -F "language=en" \
  -F "diarize=true" -F "detect_emotion=true" -F "itn=true" \
  -F "response_format=verbose_json"

Node.js WebSocket

const WebSocket = require('ws');
const ws = new WebSocket('wss://stt.mm.mk/ws/transcribe?language=pl&rate=16000&stream_id=node-client&vad=1');

ws.on('message', (data) => {
  const msg = JSON.parse(data);
  if (msg.text) console.log(`[${msg.type}] ${msg.text} (${(msg.confidence*100).toFixed(0)}%)`);
});

// Send PCM int16 audio buffers:
// ws.send(pcmBuffer);  // Buffer of int16 samples at 16kHz mono

5. Monitoring

GET /health

Basic health check — model loaded, GPU count, active streams.

GET /health/deep

Deep health check — runs a 1-second transcription test to verify the full pipeline (model, processor, VAD). Returns 503 if any check fails.

curl https://stt.mm.mk/health/deep
curl https://stt.mm.mk/health
{
  "status": "ok",
  "model": "CohereLabs/cohere-transcribe-03-2026",
  "gpu_count": 8,
  "languages": ["en","fr","de","it","es","pt","el","nl","pl","zh","ja","ko","vi","ar"],
  "active_streams": 3
}

GET /streams

curl https://stt.mm.mk/streams
{
  "kitchen": {
    "connected_since": 3600,
    "language": "pl",
    "features": {"timestamps": false, "diarize": false, "itn": false, "emotion": true, "server_vad": true},
    "last_text": "podaj mi sól",
    "last_text_ago": 12.5,
    "total_segments": 47,
    "total_audio_sec": 285.3
  },
  "office": { ... }
}

GET /streams/{stream_id}

curl https://stt.mm.mk/streams/kitchen
{
  "stream_id": "kitchen",
  "connected_since": 3600,
  "total_segments": 47,
  "total_audio_sec": 285.3,
  "transcriptions": [
    {"text": "podaj mi sól", "duration": 2.1, "confidence": 0.94, "emotion": "neutral", "ago": 12.5},
    {"text": "dziękuję", "duration": 1.3, "confidence": 0.97, "emotion": "happy", "ago": 45.2},
    ...
  ]
}

6. Features Reference

FeatureAPI paramWS paramDescription
Timestampstimestamps=truetimestamps=1Model-native timestamps in output
Diarizationdiarize=truediarize=1Speaker diarization via pyannote community-1 (runs on separate GPU). Applied on finalized segments only.
ITNitn=trueitn=1"twenty three" → "23"
Emotiondetect_emotion=truedetect_emotion=1happy, sad, angry, neutral
Server VADN/Avad=1Server-side voice activity detection (Silero VAD)
Confidence filtermin_confidence=0.3N/AFilter low-confidence hallucinations
Per-word confidencealways onalways onwords[].confidence (0-1)

Server-side VAD (Voice Activity Detection)

When vad=1 is set, the server runs Silero VAD on incoming audio. Only speech segments are passed to the transcription model — silence is discarded. This is ideal for always-on microphones (ESP32, Raspberry Pi) where the client can't run VAD locally.

WS ParamDefaultRangeDescription
vad=100 or 1Enable server-side VAD
vad_threshold0.30.1 — 0.9Speech probability threshold. Lower = catches quieter speech. Try 0.15 for soft voice, 0.5 for noisy environments.
vad_pad_ms400100 — 2000Milliseconds of silence to keep after speech ends. Higher = fewer split segments.
vad_min_ms10050 — 1000Minimum speech duration to trigger transcription. Filters out clicks/pops.

The server also sends real-time VAD status via vad_status messages (see below), showing current speech probability and state.

VAD Status Messages (WebSocket)

When server VAD is enabled, the server periodically sends status updates so the client can display a live VAD meter:

// Sent every ~250ms when VAD is active
{"type": "vad_status", "speech_prob": 0.87, "is_speech": true, "buffered_ms": 1250}
{"type": "vad_status", "speech_prob": 0.02, "is_speech": false, "buffered_ms": 0}
FieldTypeDescription
speech_probfloat 0-1Current speech probability from Silero VAD
is_speechboolWhether VAD considers this as speech (prob >= threshold)
buffered_msintMilliseconds of speech audio currently buffered

Example: Quiet Voice Setup

wss://stt.mm.mk/ws/transcribe?language=pl&rate=16000&stream_id=bedroom&vad=1&vad_threshold=0.15&vad_pad_ms=600&vad_min_ms=80

Hallucination Filtering

The server automatically filters known hallucination phrases ("Thank you.", "Thanks for watching.", etc.) that Whisper-family models produce on silence/noise. Additionally, suspiciously short text for long audio (<1 word per 5s) is filtered. Use min_confidence for explicit threshold.

Audio Format

PropertyFile uploadWebSocket streaming
FormatAny (wav, mp3, ogg, webm, mp4, flac, ...)Raw PCM int16 little-endian
Sample rateAny (auto-resampled to 16kHz)Specified via rate param (default: 16000)
ChannelsAny (auto-mixed to mono)Mono only
Max size~500MBUnlimited (streaming)

Supported Languages

en English, fr French, de German, it Italian, es Spanish, pt Portuguese, el Greek, nl Dutch, pl Polish, zh Chinese, ja Japanese, ko Korean, vi Vietnamese, ar Arabic


7. Architecture Notes



6. Speaker Enrollment & Recognition

Enroll speaker voiceprints to identify who is speaking during diarization. Uses speechbrain/spkrec-ecapa-voxceleb embeddings.

POST /speakers/enroll

Enroll a speaker from raw PCM int16 mono audio (3+ seconds). Multiple enrollments improve accuracy.

# Record 5s and enroll
ffmpeg -f alsa -i default -t 5 -ar 16000 -ac 1 -f s16le - | \
  curl -X POST "https://stt.mm.mk/speakers/enroll?name=Pawel&rate=16000" \
    -H "Content-Type: application/octet-stream" --data-binary @-

# Response:
{"status": "enrolled", "name": "Pawel", "duration": 5.0, "samples": 1}

POST /speakers/enroll/file

Enroll from an audio file (WAV, MP3, etc.).

curl -X POST "https://stt.mm.mk/speakers/enroll/file" \
  -F "name=Pawel" -F "file=@voice_sample.wav"

GET /speakers

List all enrolled speakers.

curl https://stt.mm.mk/speakers
{"speakers": [{"name": "Pawel", "samples": 3, "duration": 15.0}], "total": 1}

DELETE /speakers/{name}

Remove an enrolled speaker.

curl -X DELETE https://stt.mm.mk/speakers/Pawel

POST /speakers/identify

Identify a speaker from an audio sample (without diarization).

curl -X POST "https://stt.mm.mk/speakers/identify?rate=16000" \
  -H "Content-Type: application/octet-stream" --data-binary @audio.raw

{"identified": "Pawel", "confidence": 0.87, "threshold": 0.55, "all_scores": {"Pawel": 0.87}}

How it works

stt.mm.mk — Cohere Transcribe API | Web UI | Status