moark-tts

# Text-to-Speech (TTS) This skill supports Gitee AI TTS plus CosyVoice voice feature extraction workflows. It supports fifteen user-facing model choices for TTS: - `audiofly` - `chattts` - `cosyvoice2` - `cosyvoice3` - `cosyvoice-300m` - `fish-speech-1.2-sft` - `index-tts-1.5` - `index-tts-2` - `glm-tts` - `megatts3` - `moss-ttsd-v0.5` - `qwen-tts` - `spark-tts-0.5b` - `step-audio-tts-3b` - `vibevoice-large` When the user does not specify a model, ask them to choose one. After the model is chosen, only ask for parameters that are relevant to that model. ## Usage Use the bundled script to generate speech. ```bash python {baseDir}/scripts/perform_tts.py --model cosyvoice2 --text "你好，我是模力方舟。" --voice alloy --api-key YOUR_API_KEY ``` For CosyVoice-300M voice feature extraction (voice cloning prep), use: ```bash python {baseDir}/scripts/perform_voice_feature_extraction.py --model FunAudioLLM-CosyVoice-300M --prompt "提供用于声纹提取的提示文本" --file-url "https://example.com/sample.mp3" --api-key YOUR_API_KEY ``` ## Options - `--model` required: `audiofly`, `chattts`, `cosyvoice2`, `cosyvoice3`, `cosyvoice-300m`, `fish-speech-1.2-sft`, `index-tts-1.5`, `index-tts-2`, `glm-tts`, `megatts3`, `moss-ttsd-v0.5`, `qwen-tts`, `spark-tts-0.5b`, `step-audio-tts-3b`, or `vibevoice-large` - `--text` required in general: text to synthesize. For Qwen3-TTS multi-input mode (`--qwen-inputs-json`), `--text` is optional - `--mode` optional: `auto`, `sync`, or `async` - `--prompt` optional: model-specific style prompt such as ChatTTS tags - `--prompt-text` optional: reference transcript for style-conditioned models - `--prompt-audio-url` optional: reference audio URL for style-conditioned models - `--qwen-inputs-json` optional: structured Qwen3-TTS `inputs` JSON (array/object). Supports mixed built-in and custom voice items - `--speaker` optional: Qwen3-TTS built-in speaker for single input (`Vivian`, `Serena`, `Uncle_Fu`, `Dylan`, `Eric`, `Ryan`, `Aiden`, `Ono_Anna`, `Sohee`) - `--language` optional: Qwen3-TTS language for single input (`Chinese` or `English`) - `--instruction` optional: Qwen3-TTS style instruction for single input - `--prompt-audio-urls` optional: `vibevoice-large` reference audio; supports one URL or JSON array string such as `["https://a.wav","https://b.wav"]` - `--emo-audio-prompt-url` optional: emotion reference audio URL for IndexTTS-2 - `--emo-alpha` optional: emotion mixing weight for IndexTTS-2 audio emotion control - `--emo-text` optional: emotion control text for IndexTTS-2 - `--use-emo-text` optional: enable or disable `emo_text` for IndexTTS-2 (`true`/`false`) - `--prompt-wav-url` optional: reference prompt WAV URL for CosyVoice2 or CosyVoice3 - `--voice-url` optional: reference voice audio URL for ChatTTS or fish-speech-1.2-sft cloning - `--instruct-text` optional: model-specific instruction text such as CosyVoice2 or CosyVoice3 speaking style guidance - `--seed` optional: model-specific seed value such as CosyVoice2 or CosyVoice3 - `--audio-mode` optional: `single` or `role` for `moss-ttsd-v0.5` (required when mode cannot be inferred from fields) - `--prompt-audio-single-url` optional: single-speaker reference audio URL for `moss-ttsd-v0.5` single mode - `--prompt-text-single` optional: single-speaker reference transcript for `moss-ttsd-v0.5` single mode - `--prompt-audio-1-url` optional: speaker-1 reference audio URL for `moss-ttsd-v0.5` role mode - `--prompt-text-1` optional: speaker-1 reference transcript for `moss-ttsd-v0.5` role mode - `--prompt-audio-2-url` optional: speaker-2 reference audio URL for `moss-ttsd-v0.5` role mode - `--prompt-text-2` optional: speaker-2 reference transcript for `moss-ttsd-v0.5` role mode - `--use-normalize` optional: enable or disable `use_normalize` for `moss-ttsd-v0.5` (`true`/`false`) - `--prompt-language` optional: prompt language hint for models such as MegaTTS3 - `--intelligibility-weight` optional: pronunciation intelligibility weight for models such as MegaTTS3 - `--similarity-weight` optional: timbre similarity weight for models such as MegaTTS3 - `--temperature` optional: model-specific sampling temperature - `--top-p` optional: model-specific top-p sampling value - `--top-k` optional: model-specific top-k sampling value - `--gender` optional: async TTS gender hint - `--pitch` optional: async TTS pitch hint - `--speed` optional: async TTS speed hint (for example CosyVoice3, Spark-TTS-0.5B, or Qwen3-TTS) - `--num-inference-steps` optional: AudioFly generation step count - `--guidance-scale` optional: AudioFly classifier-free guidance scale - `--output-format` optional: AudioFly or Qwen3-TTS output format such as `mp3` or `wav` - `--voice` optional: OpenAI-compatible voice field when supported by the target model - `--extra-body-json` optional: JSON object for explicitly requested undocumented fields - `--response-data-format` optional: `url` or `blob` for sync TTS - `--output` optional: output file path when sync TTS returns binary audio - `--failover-enabled` optional: request header `X-Failover-Enabled`, defaults to `true` - `perform_voice_feature_extraction.py` options: `--prompt`, `--file-url` (URL only), `--model` (default `FunAudioLLM-CosyVoice-300M`), `--failover-enabled`, `--output`, `--api-key` ## Workflow 1. Determine whether the user wants speech synthesis or CosyVoice voice-feature extraction. 2. For speech synthesis: ask the user to choose one of `audiofly`, `chattts`, `cosyvoice2`, `cosyvoice3`, `cosyvoice-300m`, `fish-speech-1.2-sft`, `index-tts-1.5`, `index-tts-2`, `glm-tts`, `megatts3`, `moss-ttsd-v0.5`, `qwen-tts`, `spark-tts-0.5b`, `step-audio-tts-3b`, or `vibevoice-large` if not specified. 3. For speech synthesis: read [references/models.md](./references/models.md), gather missing model-specific params, and execute `perform_tts.py`. 4. For voice-feature extraction: execute `perform_voice_feature_extraction.py` with `--prompt` and URL-only `--file-url`. 5. Parse script output. 6. For TTS output, prioritize `AUDIO_URL:` then `AUDIO_FILE:` then `TTS_RESULT:`. 7. For voice feature output, prioritize `VOICE_URL:` (if present), otherwise return `VOICE_FEATURE_FILE:` and summarize `VOICE_FEATURE_RESULT:`. ## Notes - Keep the answer language consistent with the user's language. - This script is standard-library only and is intended to run directly with `python`; do not require `uv` for `moark-tts`. - If `GITEEAI_API_KEY` is missing, remind the user to provide `--api-key`. - By default, all TTS requests send `X-Failover-Enabled: true`. Only set `--failover-enabled false` when the user explicitly needs to disable failover. - `audiofly` is mapped to the official model name `AudioFly`. Use async mode only. When the user shows an OpenAI SDK example that puts `num_inference_steps`, `guidance_scale`, or `output_format` under `extra_body`, map them to `--num-inference-steps`, `--guidance-scale`, and `--output-format`. - `chattts` is mapped to the official model name `ChatTTS`. When the user shows an OpenAI SDK example that puts `prompt`, `temperature`, `top_P`, `top_K`, or `voice_url` under `extra_body`, map them to `--prompt`, `--temperature`, `--top-p`, `--top-k`, and `--voice-url`. - `cosyvoice2` is mapped to the official model name `CosyVoice2`. When the user shows an OpenAI SDK example that puts `prompt_wav_url`, `prompt_text`, `instruct_text`, or `seed` under `extra_body`, map them to `--prompt-wav-url`, `--prompt-text`, `--instruct-text`, and `--seed`. - `cosyvoice3` is mapped to the official model name `CosyVoice3`. Use async mode only. When the user shows an OpenAI SDK example that puts `prompt_wav_url`, `prompt_text`, `instruct_text`, `speed`, or `seed` under `extra_body`, map them to `--prompt-wav-url`, `--prompt-text`, `--instruct-text`, `--speed`, and `--seed`. - `cosyvoice-300m` is mapped to `FunAudioLLM-CosyVoice-300M` for sync `/audio/speech`. Map OpenAI `extra_body.voice_url` to `--voice-url`. - CosyVoice voice-feature extraction uses `/audio/voice-feature-extraction` and is handled by `perform_voice_feature_extraction.py`; `--file-url` must be an http(s) URL (no local file path support). - `fish-speech-1.2-sft` uses sync `/audio/speech`. When the user shows an OpenAI SDK example that puts `voice_url` under `extra_body`, map it to `--voice-url`. - `index-tts-1.5` currently uses the sync `/audio/speech` endpoint. When the user shows an OpenAI SDK example that puts `prompt_audio_url` under `extra_body`, map it to the script's `--prompt-audio-url`. - `index-tts-2` supports four emotion-control patterns: sync/async + audio-emotion/text-emotion. Map `emo_audio_prompt_url` + `emo_alpha` to `--emo-audio-prompt-url` + `--emo-alpha`; map `emo_text` + `use_emo_text` to `--emo-text` + `--use-emo-text`. In auto mode it defaults to sync; when user asks async, force `--mode async`. - `megatts3` is mapped to the official model name `MegaTTS3`. When the user shows an OpenAI SDK example that puts `prompt_language`, `intelligibility_weight`, or `similarity_weight` under `extra_body`, map them to `--prompt-language`, `--intelligibility-weight`, and `--similarity-weight`. - `step-audio-tts-3b` is mapped to the official model name `Step-Audio-TTS-3B`. When the user shows an OpenAI SDK example that puts `prompt_audio_url` and `prompt_text` under `extra_body`, map them to `--prompt-audio-url` and `--prompt-text`. - `spark-tts-0.5b` is mapped to the official model name `Spark-TTS-0.5B`. Use async mode only. For plain synthesis, just pass text. For voice cloning, map `prompt_audio_url` and `prompt_text` to `--prompt-audio-url` and `--prompt-text`; `gender`/`pitch`/`speed` can be passed when explicitly requested. - `qwen-tts` is mapped to the official model name `Qwen3-TTS`. Use async mode only. Prefer structured `inputs` items: - Built-in speaker item: `prompt` + `speaker` + optional `language` (`Chinese`/`English`) + optional `instruction`. - Custom voice item: `prompt` + `prompt_audio_url` + `prompt_text` + optional `language` + optional `instruction`. - Use `--qwen-inputs-json` for multiple items in one request; use `--speaker`/`--language`/`--instruction` for single-item mode. - Built-in speakers: `Vivian`, `Serena`, `Uncle_Fu`, `Dylan`, `Eric`, `Ryan`, `Aiden`, `Ono_Anna`, `Sohee`. - `moss-ttsd-v0.5` is mapped to the official model name `MOSS-TTSD-v0.5`. Use async mode only. Map single mode fields `prompt_audio_single_url` + `prompt_text_single` to `--prompt-audio-single-url` + `--prompt-text-single`, and role mode fields (`prompt_audio_1_url`, `prompt_text_1`, `prompt_audio_2_url`, `prompt_text_2`) to the matching CLI options. Pass `audio_mode` through `--audio-mode` and `use_normalize` through `--use-normalize`. - `vibevoice-large` is mapped to the official model name `VibeVoice-Large`. Use async mode only. Map `prompt_audio_urls` to `--prompt-audio-urls`, and accept both a single URL string and a JSON array string. When the user provides only `prompt_audio_url`, map it into `prompt_audio_urls` automatically for compatibility. - `glm-tts` currently exposes only the basic sync request in the official OpenAPI spec. - Do not invent model parameters. If a field is not documented for that model, only pass it when the user explicitly asked for it and use `--extra-body-json`.

moark-tts

标签

通过对话安装

方式一：安装 SkillHub 和技能

方式二：设置 SkillHub 为优先技能安装源

通过命令行安装

下载 Zip 包

moark-tts

moark-tts

标签

通过对话安装

方式一：安装 SkillHub 和技能

方式二：设置 SkillHub 为优先技能安装源

通过命令行安装

下载 Zip 包

相关推荐

self-improvement

self-improvement

self-improvement

self-improvement