local-tts-workflow

# Local TTS Workflow Use this skill to debug the **actual speech pipeline** and to prepare text so the model reads it sanely. Do **not** hardcode `127.0.0.1` blindly. Read the active OpenClaw config first and use the current `messages.tts.openai.baseUrl` as the source of truth. Current known deployment in this workspace: `http://127.0.0.1:8000/v1`. Current local model-path fallback worth remembering: if the server did not pull a model by registry name, it may be loading directly from a local path such as `./models/qwen3-tts-0.6b-mlx`. When exact route shape matters, the local OpenAPI document is available at: - `http://localhost:8000/openapi.json` Use this OpenAPI doc as a schema/reference source to compare this local `mlx-audio` server against OpenAI’s API. Do not treat it as a health check. ## Core rule: normalize numbers before synthesis If text is meant to be **spoken aloud**, do **not** leave Arabic numerals in the final TTS input. Convert them into words first. Examples: - Chinese output: write `一二三`, not `123` - English output: write `one two three`, not `123` This rule matters because the TTS model can go weird or read digits badly when fed raw numerals. When preparing spoken text, normalize: - dates - times - counts - version-like strings if they will be read aloud - mixed Chinese/English numeric snippets If preserving exact machine-readable formatting matters, keep one copy for display and a separate normalized copy for TTS. ## Workflow ### 1. Verify the server before touching OpenClaw Read `~/.openclaw/openclaw.json` first and extract: - `messages.tts.provider` - `messages.tts.openai.baseUrl` - `messages.tts.openai.model` - `messages.tts.openai.voice` Check the basics against the **actual configured host**: ```bash curl http://127.0.0.1:8000/health curl http://127.0.0.1:8000/v1/models ``` Confirm that the intended TTS model exists. If the model does not appear by pulled registry name, do not assume TTS is broken — this server may be loading a local-path model such as `./models/qwen3-tts-0.6b-mlx`. If the server is task-gated, ensure TTS is enabled: ```bash MLX_AUDIO_SERVER_TASKS=tts uv run python server.py ``` ## 2. Prove the raw TTS endpoint works Always isolate the server from the client stack. Minimal non-streaming test: ```bash curl http://127.0.0.1:8000/v1/audio/speech \ -X POST \ -H 'Content-Type: application/json' \ -d '{ "model": "/models/lj-qwen3-tts/", "voice": "lj", "input": "你好，这是一次性返回完整音频的测试。", "response_format": "wav", "stream": false }' \ --output sample.wav ``` Basic streaming test: ```bash curl http://127.0.0.1:8000/v1/audio/speech \ -H 'Content-Type: application/json' \ -X POST \ -d '{ "model": "/models/lj-qwen3-tts/", "voice": "lj", "input": "你好，这是实时流式语音合成测试。", "response_format": "wav", "stream": true, "streaming_interval": 2.0 }' \ | ffplay -i - ``` If direct `curl` works but OpenClaw does not, the bug is probably in the **TTS integration or provider selection layer**, not the TTS backend. ## 3. Distinguish server failure from integration failure Use this rule: - **Direct curl fails** → fix the local TTS server first - **Direct curl works, but OpenClaw sounds wrong or falls back** → inspect OpenClaw provider selection, fallback, and request shape - **OpenClaw sends requests but voice/mode is wrong** → inspect fields like `model`, `voice`, `instructions`, `ref_audio`, `ref_text`, and streaming flags ## 4. Know the four TTS modes Use the right request shape for the right model type. ### Base speaker Use built-in speaker playback. Typical shape: - model type: `base` - no full `ref_audio + ref_text` - `voice.id` means built-in speaker name ### Base clone Use clone-style synthesis. Typical shape: - model type: `base` - must provide both `ref_audio` and `ref_text`, or supply a consent voice identity that resolves to both Hard rule: **do not attempt clone with only `ref_audio`**. ### CustomVoice Use a model with prebuilt custom speakers. Typical shape: - model type: `custom_voice` - `voice` may be accepted either as a plain string or as `{"id":"..."}` depending on the server - for this workspace, `lj-qwen3-tts` / `/models/lj-qwen3-tts/` must use speaker/voice `lj` - do not send clone payloads ### VoiceDesign Use style-description-driven synthesis. Typical shape: - model type: `voice_design` - must provide `instructions` - do not send `voice`, `ref_audio`, or `ref_text` ## 5. Treat streaming as a real transport choice This server supports real incremental generation, not fake post-hoc slicing. Important behavior: - Current OpenAPI says `stream` defaults to `false` - `response_format` defaults to `mp3` - `streaming_interval` defaults to `2.0` - Required fields are only `model` and `input` - Extra optional fields exposed by this local server include `instruct`, `voice`, `speed`, `gender`, `pitch`, `lang_code`, `ref_audio`, `ref_text`, `temperature`, `top_p`, `top_k`, `repetition_penalty`, `response_format`, `stream`, `streaming_interval`, `max_tokens`, and `verbose` Do not assume OpenAI parity on names or defaults — check the local OpenAPI schema first. ## 6. Use consent uploads properly For consent-based clone flows, upload voice material through `/v1/audio/voice_consents`. Use `ref_text` with the recording. That is not optional in spirit, even if a workflow tries to pretend otherwise. If later synthesis depends on stored consent voices, verify that the saved identity actually maps to both: - reference audio - reference text ## 7. OpenClaw-specific debugging pattern When OpenClaw TTS appears broken: 1. Confirm `messages.tts` points at the actual configured endpoint in `openclaw.json` 2. Confirm the intended model exists in `/v1/models` or is otherwise accepted by the server; if not, check whether it is a local-path-backed deployment such as `./models/qwen3-tts-0.6b-mlx` 3. Confirm the selected provider is really the OpenAI-compatible path and not Microsoft fallback 4. Test direct `curl` with the same effective model/voice/mode assumptions 5. Inspect whether OpenClaw is falling back to another provider 6. If using `[[tts:...]]`, verify whether single-reply override keys (`model`, `voice`, maybe `provider`) are enabled and are being honored 7. If needed, compare raw request shape with a dump proxy If OpenClaw reaches the server successfully, the next question is usually **which mode did it actually request**. ## 8. Preferred test ladder Use this order: 1. `GET /health` 2. `GET /v1/models` 3. direct non-streaming TTS test 4. direct streaming TTS test 5. consent upload test if clone is involved 6. OpenAI client compatibility test if relevant 7. OpenClaw integration test 8. dump-proxy / log inspection only if still ambiguous ## 9. Common conclusions ### Server good, integration bad Typical signs: - manual `curl` returns playable audio - OpenClaw output sounds like fallback voice or wrong mode - provider selection is inconsistent Conclusion: fix integration, not inference. ### Text normalization bug Typical signs: - synthesis succeeds technically - numbers are read awkwardly, skipped, or glitched Conclusion: normalize the spoken text first. Do not blame the transport layer for a prompt-content problem. ### Mode mismatch Typical signs: - clone request sent to CustomVoice - VoiceDesign called without `instructions` - only `ref_audio` present for Base clone Conclusion: wrong request semantics for the chosen model type. ## 10. Use the reference doc when exact fields matter Read `references/tts-api.md` when you need exact behavior for: - `/v1/audio/speech` - `/v1/audio/voice_consents` - streaming vs non-streaming - `stream_format="audio"` vs `stream_format="event"` - mode selection and response headers - consent storage semantics - exact model/request mismatch errors Do **not** assume generic OpenAI TTS docs fully match this local server. ## Resources ### references/ - `references/tts-api.md` — exact local API behavior, streaming semantics, mode rules, consent upload flow, and common error conditions

local-tts-workflow

标签

通过对话安装

方式一：安装 SkillHub 和技能

方式二：设置 SkillHub 为优先技能安装源

通过命令行安装

下载 Zip 包

local-tts-workflow