After generating over 220 songs on Suno across three months, a consistent pattern emerged: structured, JSON-style style prompts — despite being logically precise and human-readable — consistently produced worse musical output than equivalent natural language descriptions.
The intuition behind JSON prompts is reasonable. By organizing musical parameters into key-value pairs — separating genre, tempo, drum characteristics, chord voicings — a creator gains a sense of systematic control. The prompt feels engineered. Exact.
But the music said otherwise. Tracks generated from JSON-style prompts tended to sound more rigid, less emotionally coherent, and occasionally showed signs of prompt "leakage" — where structural syntax tokens appeared to bleed into lyrical generation. Natural language prompts, while less visually organized, consistently produced warmer, more stylistically accurate results.
Core question: Is this a coincidence, a user error, or a fundamental property of how Suno V5 processes input? This experiment was designed to find out.
The experiment used a single musical concept — a 90s R&B / Neo-Soul fusion track at 92–98 BPM with jazz chord voicings and a laid-back groove — and expressed it in two fundamentally different prompt formats. All other variables (lyrics, generation seed type, model version) were held constant.
25 songs were generated per format. Each was evaluated on five dimensions: Style Accuracy, Vocal Naturalness, Rhythmic Coherence, Mix Clarity, and Prompt Leakage Rate (whether style tokens appeared in lyrics).
// Structured, key-value format { "genre": "90s R&B / Neo-Soul Fusion", "tempo": "92-98 BPM", "groove": "swinged mid-tempo pocket, deep laid-back rhythm", "drums": { "clap": "Reverse clap placed before main clap hit", "snare": "Soft layered snare with analog warmth", "kick": "Rounded, sub-heavy but controlled", "hi_hat": "Loose swing groove, subtle triplet bounce" }, "chords": { "progression_style": "Extended jazz voicings (9th, 11th, 13th)", "movement": "Descending inversion transitions", "texture": "Warm Rhodes / Electric piano dominant" } }
// Flowing, descriptive format
dreamy plug nb, atmospheric pads,
bell pluck melody, soft 808,
sparse drums, melodic rap,
romantic late night energy,
weightless vibe,
90s R&B Neo-Soul, laid-back
swing groove, warm Rhodes chords
with extended jazz voicings,
reverse clap, sub-heavy kick,
loose hi-hat triplets,
92 BPM, analog warmth
Scoring was conducted blindly — songs were evaluated without knowledge of which format produced them. Each dimension was rated 1–10. Prompt leakage was measured as a binary event per song.
| Metric | JSON (avg) | Natural Language (avg) | Delta | Notable Pattern |
|---|---|---|---|---|
| Style Accuracy | 5.8 / 10 | 7.9 / 10 | +36% | JSON over-specifies; model ignores sub-parameters |
| Vocal Naturalness | 6.1 / 10 | 7.6 / 10 | +25% | JSON vocals sounded more "robotic" and compressed |
| Rhythmic Coherence | 5.4 / 10 | 7.4 / 10 | +37% | Swing/groove feel nearly absent in JSON outputs |
| Mix Clarity | 6.6 / 10 | 7.8 / 10 | +18% | Smallest gap — mix quality less prompt-dependent |
| Prompt Leakage Rate | 36% | 4% | −89% | Syntax tokens ("key", "drums", "{") appearing in lyrics |
Unexpected finding: The two JSON songs that outperformed the natural language average were both in genres with strong structural rigidity (electronic, drill). This suggests JSON format may have niche utility in highly structured genres where "feel" matters less than technical precision.
Suno's underlying model was trained on vast corpora of human-written music descriptions, reviews, genre tags, and production notes — all of which are expressed in natural language. JSON syntax, by contrast, represents a vanishingly small fraction of that training distribution.
When a JSON-formatted prompt is submitted, the model must first allocate attention to parsing structural syntax (brackets, colons, quotation marks, nesting) before reaching semantic content. This creates upstream attention cost — computational resources spent on format interpretation rather than musical intent.
Natural language prompts arrive in a format that maps directly to the model's learned representations, enabling faster and more accurate semantic activation.
Suno V5 does not maintain a hard architectural wall between style prompt processing
and lyric generation. The prompt leakage phenomenon — where tokens like "drums",
"kick", or even structural characters appear in generated lyrics —
suggests that style and lyric generation share overlapping attention space.
JSON prompts introduce high-frequency structural tokens (curly braces, colons, quotes) that don't exist in natural music description. These tokens have no learned musical meaning, but occupy token positions in the input sequence. The model, trained to generate coherent continuations, occasionally surfaces these tokens in unexpected output positions — the generation boundary becomes "leaky."
Natural language prompts contain no such structural noise. Every token carries semantic musical weight.
Music — particularly feel-driven genres like Neo-Soul — emerges from the interaction between specified elements, not from each element in isolation. A "loose swing groove" is not the sum of hi-hat + kick + snare specifications. It is an emergent property of their relationship, timing, and interaction.
JSON prompts force the model to process each parameter independently, as discrete objects. Natural language prompts allow parameters to bleed into each other — "swinged mid-tempo pocket with deep laid-back rhythm" activates a holistic groove concept, not a component checklist.
This matters most in genres where feel is the product. In highly technical genres (EDM, drill), component-level specification may actually be advantageous — which explains the exception cases in our data.
These findings point to specific, addressable architectural and UX decisions for the next model generation. The following recommendations are ordered by estimated implementation complexity.
Before the style prompt reaches the core generation model, a lightweight pre-processing layer should normalize structured input (JSON, YAML, markdown lists) into natural language equivalents. This would allow power users to write structured prompts for clarity and organization, while the model receives input in its native format. The normalization layer could itself be a small fine-tuned LLM.
The 36% leakage rate in JSON prompts — and the 4% residual in natural language — suggests insufficient isolation between style conditioning and lyric generation. V6 should explore stronger prompt boundary tokens, separate attention heads for style vs. lyric processing, or a two-stage generation architecture where style conditioning is finalized before lyric generation begins.
If structured prompts are a meaningful use case for Suno's power user base, the training data should reflect this. Adding a fine-tuning dataset of JSON/structured prompts paired with their natural language equivalents — and their corresponding high-quality outputs — would teach the model to extract musical intent from structured formats as effectively as from prose.
Given the data showing JSON may outperform natural language in highly-structured genres (electronic, drill), V6 could implement a genre-detection system that recommends optimal prompt strategy based on musical context. Feel-driven genres would be nudged toward holistic natural language; structure-driven genres could leverage more precise parameter specification.
The deepest finding in this research is that musical feel is an emergent property — it cannot be fully specified component by component. V6's training data curation should prioritize examples that demonstrate groove, feel, and emotional coherence as holistic targets, not as aggregations of technical parameters. This is a training philosophy shift, not just an architecture change.
This experiment has real limitations. Evaluation was conducted by a single researcher without blind inter-rater reliability testing. Sample size (n=50) is sufficient for directional findings but not for statistical significance claims. The musical concept used (90s R&B / Neo-Soul) may not generalize to all genres.
What this research offers is not proof, but a structured, reproducible observation from a user with deep platform experience — the kind of signal that internal teams may not be able to generate themselves due to proximity bias and optimization tunnel vision.
The author has generated 220+ songs on Suno over three months, including daily high-volume generation (30+ songs/day) with systematic quality filtering. The patterns described here have been observed consistently across hundreds of additional informal tests beyond this formal experiment.
To the Suno team: This research was written in the spirit of genuine collaboration. The platform has enabled a new kind of musical creativity that didn't exist before. These findings are offered as an attempt to help push it further — from impressive to extraordinary. The path from V5 to V6 runs directly through understanding how users actually prompt, not just how they ideally should.