Real-time speech-to-text API powered by Cohere Transcribe (8x GPU). Supports streaming WebSocket, file upload, batch processing, speaker diarization, emotion detection, and more.
| Method | Path | Description |
|---|---|---|
| GET | / | Web UI with live mic transcription |
| GET | /docs | This documentation page |
| GET | /health | Server status, GPU count, active streams |
| GET | /streams | List all active WebSocket streams |
| GET | /streams/{id} | Stream detail with last 50 transcriptions |
| GET | /client.py | Download Python streaming client |
| POST | /v1/audio/transcriptions | Transcribe a single audio file |
| POST | /v1/audio/transcriptions/batch | Transcribe multiple files at once |
| WS | /ws/transcribe | Real-time streaming transcription |
Upload an audio file and get transcription back. OpenAI-compatible endpoint.
| Parameter | Type | Default | Description |
|---|---|---|---|
file | file | required | Audio file (wav, mp3, ogg, webm, mp4, flac, etc.) |
language | string | en | Language: en, pl, fr, de, it, es, pt, el, nl, zh, ja, ko, vi, ar |
punctuation | bool | true | Add punctuation to output |
response_format | string | json | json, verbose_json (with segments), text (plain) |
min_confidence | float | 0 | Confidence threshold 0-1. Below this → empty text |
timestamps | bool | false | Enable model-native timestamps |
diarize | bool | false | Enable speaker diarization |
itn | bool | false | Inverse text normalization ("twenty three" → "23") |
detect_emotion | bool | false | Detect emotion: happy, sad, angry, neutral |
# Basic transcription (Polish)
curl -X POST https://stt.mm.mk/v1/audio/transcriptions \
-F "file=@recording.mp3" \
-F "language=pl"
# With all features
curl -X POST https://stt.mm.mk/v1/audio/transcriptions \
-F "file=@meeting.wav" \
-F "language=en" \
-F "diarize=true" \
-F "detect_emotion=true" \
-F "itn=true" \
-F "timestamps=true" \
-F "response_format=verbose_json"
# Plain text output
curl -X POST https://stt.mm.mk/v1/audio/transcriptions \
-F "file=@speech.ogg" \
-F "language=pl" \
-F "response_format=text"
{
"text": "Pani Przewodnicząca, jeszcze jedno pytanie.",
"language": "pl",
"duration": 5.2,
"processing_time": 0.85,
"rtf": 0.163,
"confidence": 0.9234,
"avg_logprob": -0.08,
"words": [
{"word": "Pani", "confidence": 0.95, "avg_logprob": -0.05, "start": 0.0, "end": 0.62},
{"word": "Przewodnicząca,", "confidence": 0.89, "avg_logprob": -0.12, "start": 0.62, "end": 2.41},
{"word": "jeszcze", "confidence": 0.94, "avg_logprob": -0.06, "start": 2.41, "end": 3.25},
{"word": "jedno", "confidence": 0.91, "avg_logprob": -0.09, "start": 3.25, "end": 3.85},
{"word": "pytanie.", "confidence": 0.93, "avg_logprob": -0.07, "start": 3.85, "end": 5.2}
],
"emotion": "neutral",
"speakers": [
{"speaker": "spk0", "text": "Pani Przewodnicząca, jeszcze jedno pytanie."}
]
}
emotion and speakers only appear when their respective features are enabled. words always includes per-word confidence and estimated timestamps.
Upload multiple files at once. Same parameters as single transcription, but use files (plural) for multiple file uploads.
# Batch: 3 files
curl -X POST https://stt.mm.mk/v1/audio/transcriptions/batch \
-F "files=@a.mp3" \
-F "files=@b.wav" \
-F "files=@c.ogg" \
-F "language=pl" \
-F "diarize=true"
[
{"filename": "a.mp3", "text": "...", "duration": 12.5, "confidence": 0.92, "words": [...], ...},
{"filename": "b.wav", "text": "...", "duration": 8.3, "confidence": 0.88, "words": [...], ...},
{"filename": "c.ogg", "error": "Could not decode audio"}
]
Real-time streaming transcription. Client sends raw PCM audio, server returns progressive transcription updates.
wss://stt.mm.mk/ws/transcribe?language=pl&rate=16000&stream_id=kitchen
| Query Param | Default | Description |
|---|---|---|
language | en | Language code |
rate | 16000 | Sample rate of incoming audio |
stream_id | anon | Identifier for this stream (e.g. room name) |
vad | 0 | Set to 1 for server-side VAD (for dumb clients like ESP32) |
timestamps | 0 | Enable timestamps |
diarize | 0 | Enable speaker diarization |
itn | 0 | Enable inverse text normalization |
detect_emotion | 0 | Enable emotion detection |
| Direction | Format | Content |
|---|---|---|
| Client → Server | Binary | Raw PCM int16 mono audio at specified sample rate |
| Client → Server | Binary (empty) | Empty buffer = "end of stream" signal |
| Server → Client | JSON text | Transcription updates (see below) |
// Partial — live update, text may change as more audio arrives
{"type": "partial", "text": "Pani Przewodnicząca", "duration": 2.5,
"confidence": 0.85, "words": [...]}
// Final — segment complete, text is locked
{"type": "final", "text": "Pani Przewodnicząca, jeszcze jedno pytanie.",
"duration": 5.2, "confidence": 0.92, "words": [...],
"emotion": "neutral", "speakers": [...]}
partial updates. After ~10s it sends final and starts a new segment. With vad=1, the server detects speech segments automatically and sends final for each utterance.
Mode 1: Smart client (with client-side VAD) — client runs Silero VAD locally, only sends speech audio. Best for Raspberry Pi with Python.
Mode 2: Dumb client (with server-side VAD, vad=1) — client sends ALL audio continuously, server filters silence. Best for ESP32, microcontrollers, or simple scripts.
# Download the full-featured client with VAD
curl -o stt_client.py https://stt.mm.mk/client.py
# Install dependencies
pip install websocket-client numpy sounddevice torch
# Run with defaults (Polish, local VAD)
python3 stt_client.py --stream-id kitchen --language pl
# All features enabled
python3 stt_client.py -s bedroom -l pl --emotion --diarize --itn --timestamps -v
# List microphone devices
python3 stt_client.py --list-devices
# Use specific mic device
python3 stt_client.py --device 2 -s office
import websocket, sounddevice as sd, numpy as np, json, sys
ws = websocket.WebSocket()
ws.connect("wss://stt.mm.mk/ws/transcribe?language=pl&rate=16000&stream_id=myroom&vad=1")
def callback(indata, frames, time, status):
pcm = (indata[:, 0] * 32767).astype(np.int16)
ws.send_binary(pcm.tobytes())
with sd.InputStream(samplerate=16000, channels=1, dtype='float32',
blocksize=4096, callback=callback):
print("Streaming... Ctrl+C to stop")
while True:
msg = json.loads(ws.recv())
if msg.get("text"):
prefix = "FINAL" if msg["type"] == "final" else "..."
print(f"[{prefix}] {msg['text']}")
// Connect to WebSocket with server-side VAD
// wss://stt.mm.mk/ws/transcribe?language=pl&rate=16000&stream_id=esp-kitchen&vad=1
// 1. Init I2S microphone at 16kHz mono
i2s_config_t cfg = { .sample_rate = 16000, .bits_per_sample = 16, .channel_format = MONO };
// 2. In loop: read I2S → send over WebSocket
while (true) {
int16_t buffer[512];
i2s_read(I2S_NUM_0, buffer, sizeof(buffer), &bytes_read, portMAX_DELAY);
ws.sendBinary((uint8_t*)buffer, bytes_read); // raw PCM int16
// 3. Check for incoming text messages
if (ws.available()) {
String msg = ws.readString();
// Parse JSON: {"type":"final","text":"...","confidence":0.92}
// Display on OLED, send to MQTT, trigger automation, etc.
}
}
# Simple
curl -X POST https://stt.mm.mk/v1/audio/transcriptions \
-F "file=@audio.mp3" -F "language=pl"
# Full featured
curl -X POST https://stt.mm.mk/v1/audio/transcriptions \
-F "file=@meeting.wav" -F "language=en" \
-F "diarize=true" -F "detect_emotion=true" -F "itn=true" \
-F "response_format=verbose_json"
const WebSocket = require('ws');
const ws = new WebSocket('wss://stt.mm.mk/ws/transcribe?language=pl&rate=16000&stream_id=node-client&vad=1');
ws.on('message', (data) => {
const msg = JSON.parse(data);
if (msg.text) console.log(`[${msg.type}] ${msg.text} (${(msg.confidence*100).toFixed(0)}%)`);
});
// Send PCM int16 audio buffers:
// ws.send(pcmBuffer); // Buffer of int16 samples at 16kHz mono
Basic health check — model loaded, GPU count, active streams.
Deep health check — runs a 1-second transcription test to verify the full pipeline (model, processor, VAD). Returns 503 if any check fails.
curl https://stt.mm.mk/health/deep
curl https://stt.mm.mk/health
{
"status": "ok",
"model": "CohereLabs/cohere-transcribe-03-2026",
"gpu_count": 8,
"languages": ["en","fr","de","it","es","pt","el","nl","pl","zh","ja","ko","vi","ar"],
"active_streams": 3
}
curl https://stt.mm.mk/streams
{
"kitchen": {
"connected_since": 3600,
"language": "pl",
"features": {"timestamps": false, "diarize": false, "itn": false, "emotion": true, "server_vad": true},
"last_text": "podaj mi sól",
"last_text_ago": 12.5,
"total_segments": 47,
"total_audio_sec": 285.3
},
"office": { ... }
}
curl https://stt.mm.mk/streams/kitchen
{
"stream_id": "kitchen",
"connected_since": 3600,
"total_segments": 47,
"total_audio_sec": 285.3,
"transcriptions": [
{"text": "podaj mi sól", "duration": 2.1, "confidence": 0.94, "emotion": "neutral", "ago": 12.5},
{"text": "dziękuję", "duration": 1.3, "confidence": 0.97, "emotion": "happy", "ago": 45.2},
...
]
}
| Feature | API param | WS param | Description |
|---|---|---|---|
| Timestamps | timestamps=true | timestamps=1 | Model-native timestamps in output |
| Diarization | diarize=true | diarize=1 | Speaker diarization via pyannote community-1 (runs on separate GPU). Applied on finalized segments only. |
| ITN | itn=true | itn=1 | "twenty three" → "23" |
| Emotion | detect_emotion=true | detect_emotion=1 | happy, sad, angry, neutral |
| Server VAD | N/A | vad=1 | Server-side voice activity detection (Silero VAD) |
| Confidence filter | min_confidence=0.3 | N/A | Filter low-confidence hallucinations |
| Per-word confidence | always on | always on | words[].confidence (0-1) |
When vad=1 is set, the server runs Silero VAD on incoming audio. Only speech segments are passed to the transcription model — silence is discarded. This is ideal for always-on microphones (ESP32, Raspberry Pi) where the client can't run VAD locally.
| WS Param | Default | Range | Description |
|---|---|---|---|
vad=1 | 0 | 0 or 1 | Enable server-side VAD |
vad_threshold | 0.3 | 0.1 — 0.9 | Speech probability threshold. Lower = catches quieter speech. Try 0.15 for soft voice, 0.5 for noisy environments. |
vad_pad_ms | 400 | 100 — 2000 | Milliseconds of silence to keep after speech ends. Higher = fewer split segments. |
vad_min_ms | 100 | 50 — 1000 | Minimum speech duration to trigger transcription. Filters out clicks/pops. |
The server also sends real-time VAD status via vad_status messages (see below), showing current speech probability and state.
When server VAD is enabled, the server periodically sends status updates so the client can display a live VAD meter:
// Sent every ~250ms when VAD is active
{"type": "vad_status", "speech_prob": 0.87, "is_speech": true, "buffered_ms": 1250}
{"type": "vad_status", "speech_prob": 0.02, "is_speech": false, "buffered_ms": 0}
| Field | Type | Description |
|---|---|---|
speech_prob | float 0-1 | Current speech probability from Silero VAD |
is_speech | bool | Whether VAD considers this as speech (prob >= threshold) |
buffered_ms | int | Milliseconds of speech audio currently buffered |
wss://stt.mm.mk/ws/transcribe?language=pl&rate=16000&stream_id=bedroom&vad=1&vad_threshold=0.15&vad_pad_ms=600&vad_min_ms=80
The server automatically filters known hallucination phrases ("Thank you.", "Thanks for watching.", etc.) that Whisper-family models produce on silence/noise. Additionally, suspiciously short text for long audio (<1 word per 5s) is filtered. Use min_confidence for explicit threshold.
| Property | File upload | WebSocket streaming |
|---|---|---|
| Format | Any (wav, mp3, ogg, webm, mp4, flac, ...) | Raw PCM int16 little-endian |
| Sample rate | Any (auto-resampled to 16kHz) | Specified via rate param (default: 16000) |
| Channels | Any (auto-mixed to mono) | Mono only |
| Max size | ~500MB | Unlimited (streaming) |
en English, fr French, de German, it Italian, es Spanish, pt Portuguese, el Greek, nl Dutch, pl Polish, zh Chinese, ja Japanese, ko Korean, vi Vietnamese, ar Arabic
asyncio.to_thread() — WS event loop is never blockedEnroll speaker voiceprints to identify who is speaking during diarization. Uses speechbrain/spkrec-ecapa-voxceleb embeddings.
Enroll a speaker from raw PCM int16 mono audio (3+ seconds). Multiple enrollments improve accuracy.
# Record 5s and enroll
ffmpeg -f alsa -i default -t 5 -ar 16000 -ac 1 -f s16le - | \
curl -X POST "https://stt.mm.mk/speakers/enroll?name=Pawel&rate=16000" \
-H "Content-Type: application/octet-stream" --data-binary @-
# Response:
{"status": "enrolled", "name": "Pawel", "duration": 5.0, "samples": 1}
Enroll from an audio file (WAV, MP3, etc.).
curl -X POST "https://stt.mm.mk/speakers/enroll/file" \
-F "name=Pawel" -F "file=@voice_sample.wav"
List all enrolled speakers.
curl https://stt.mm.mk/speakers
{"speakers": [{"name": "Pawel", "samples": 3, "duration": 15.0}], "total": 1}
Remove an enrolled speaker.
curl -X DELETE https://stt.mm.mk/speakers/Pawel
Identify a speaker from an audio sample (without diarization).
curl -X POST "https://stt.mm.mk/speakers/identify?rate=16000" \
-H "Content-Type: application/octet-stream" --data-binary @audio.raw
{"identified": "Pawel", "confidence": 0.87, "threshold": 0.55, "all_scores": {"Pawel": 0.87}}
SPEAKER_XX. Adjust via POST /speakers/threshold?threshold=0.6.identified_as field appears in speaker segments.