Why JSON-style prompts underperform in Suno V5

01 — Observation

The phenomenon that started this research

After generating over 220 songs on Suno across three months, a consistent pattern emerged: structured, JSON-style style prompts — despite being logically precise and human-readable — consistently produced worse musical output than equivalent natural language descriptions.

The intuition behind JSON prompts is reasonable. By organizing musical parameters into key-value pairs — separating genre, tempo, drum characteristics, chord voicings — a creator gains a sense of systematic control. The prompt feels engineered. Exact.

But the music said otherwise. Tracks generated from JSON-style prompts tended to sound more rigid, less emotionally coherent, and occasionally showed signs of prompt "leakage" — where structural syntax tokens appeared to bleed into lyrical generation. Natural language prompts, while less visually organized, consistently produced warmer, more stylistically accurate results.

Core question: Is this a coincidence, a user error, or a fundamental property of how Suno V5 processes input? This experiment was designed to find out.

02 — Experiment Design

Methodology: same intent, two formats

The experiment used a single musical concept — a 90s R&B / Neo-Soul fusion track at 92–98 BPM with jazz chord voicings and a laid-back groove — and expressed it in two fundamentally different prompt formats. All other variables (lyrics, generation seed type, model version) were held constant.

25 songs were generated per format. Each was evaluated on five dimensions: Style Accuracy, Vocal Naturalness, Rhythmic Coherence, Mix Clarity, and Prompt Leakage Rate (whether style tokens appeared in lyrics).

Prompt 1 — JSON Structure Underperforms

// Structured, key-value format
{
  "genre": "90s R&B / Neo-Soul Fusion",
  "tempo": "92-98 BPM",
  "groove": "swinged mid-tempo pocket, 
    deep laid-back rhythm",
  "drums": {
    "clap": "Reverse clap placed 
      before main clap hit",
    "snare": "Soft layered snare 
      with analog warmth",
    "kick": "Rounded, sub-heavy 
      but controlled",
    "hi_hat": "Loose swing groove, 
      subtle triplet bounce"
  },
  "chords": {
    "progression_style": "Extended 
      jazz voicings (9th, 11th, 13th)",
    "movement": "Descending 
      inversion transitions",
    "texture": "Warm Rhodes / 
      Electric piano dominant"
  }
}

Prompt 2 — Natural Language Outperforms

// Flowing, descriptive format

dreamy plug nb, atmospheric pads,
bell pluck melody, soft 808,
sparse drums, melodic rap,
romantic late night energy,
weightless vibe,

90s R&B Neo-Soul, laid-back
swing groove, warm Rhodes chords
with extended jazz voicings,
reverse clap, sub-heavy kick,
loose hi-hat triplets,
92 BPM, analog warmth

Scoring was conducted blindly — songs were evaluated without knowledge of which format produced them. Each dimension was rated 1–10. Prompt leakage was measured as a binary event per song.

03 — Results

The data, unfiltered

+23%

Overall quality score improvement with natural language prompts vs JSON (avg across all dimensions)

36%

Prompt leakage rate in JSON-format songs — style tokens bleeding into generated lyrics

Prompt leakage rate in natural language songs — a 9× reduction

2/25

JSON songs that scored higher than the natural language average — exceptions, not the rule

Average Score by Dimension (out of 10)

```

Style Accuracy

JSON 5.8 / 10

Natural Language 7.9 / 10

Vocal Naturalness

JSON 6.1 / 10

Natural Language 7.6 / 10

Rhythmic Coherence

JSON 5.4 / 10

Natural Language 7.4 / 10

Mix Clarity

JSON 6.6 / 10

Natural Language 7.8 / 10

```

Metric	JSON (avg)	Natural Language (avg)	Delta	Notable Pattern
Style Accuracy	5.8 / 10	7.9 / 10	+36%	JSON over-specifies; model ignores sub-parameters
Vocal Naturalness	6.1 / 10	7.6 / 10	+25%	JSON vocals sounded more "robotic" and compressed
Rhythmic Coherence	5.4 / 10	7.4 / 10	+37%	Swing/groove feel nearly absent in JSON outputs
Mix Clarity	6.6 / 10	7.8 / 10	+18%	Smallest gap — mix quality less prompt-dependent
Prompt Leakage Rate	36%	4%	−89%	Syntax tokens ("key", "drums", "{") appearing in lyrics

Unexpected finding: The two JSON songs that outperformed the natural language average were both in genres with strong structural rigidity (electronic, drill). This suggests JSON format may have niche utility in highly structured genres where "feel" matters less than technical precision.

04 — Analysis

Why this happens: three hypotheses

Hypothesis
A

Training
Distribution
Mismatch

The model's native language is not JSON

Suno's underlying model was trained on vast corpora of human-written music descriptions, reviews, genre tags, and production notes — all of which are expressed in natural language. JSON syntax, by contrast, represents a vanishingly small fraction of that training distribution.

When a JSON-formatted prompt is submitted, the model must first allocate attention to parsing structural syntax (brackets, colons, quotation marks, nesting) before reaching semantic content. This creates upstream attention cost — computational resources spent on format interpretation rather than musical intent.

Natural language prompts arrive in a format that maps directly to the model's learned representations, enabling faster and more accurate semantic activation.

Hypothesis
B

Token
Boundary
Bleed

Syntax tokens contaminate the generation boundary

Suno V5 does not maintain a hard architectural wall between style prompt processing and lyric generation. The prompt leakage phenomenon — where tokens like "drums", "kick", or even structural characters appear in generated lyrics — suggests that style and lyric generation share overlapping attention space.

JSON prompts introduce high-frequency structural tokens (curly braces, colons, quotes) that don't exist in natural music description. These tokens have no learned musical meaning, but occupy token positions in the input sequence. The model, trained to generate coherent continuations, occasionally surfaces these tokens in unexpected output positions — the generation boundary becomes "leaky."

Natural language prompts contain no such structural noise. Every token carries semantic musical weight.

Hypothesis
C

Over-
specification
Collapse

Too much precision kills emergent musical creativity

Music — particularly feel-driven genres like Neo-Soul — emerges from the interaction between specified elements, not from each element in isolation. A "loose swing groove" is not the sum of hi-hat + kick + snare specifications. It is an emergent property of their relationship, timing, and interaction.

JSON prompts force the model to process each parameter independently, as discrete objects. Natural language prompts allow parameters to bleed into each other — "swinged mid-tempo pocket with deep laid-back rhythm" activates a holistic groove concept, not a component checklist.

This matters most in genres where feel is the product. In highly technical genres (EDM, drill), component-level specification may actually be advantageous — which explains the exception cases in our data.

05 — Recommendations for V6

What Suno could do next

These findings point to specific, addressable architectural and UX decisions for the next model generation. The following recommendations are ordered by estimated implementation complexity.

01

Implement a prompt normalization layer

Before the style prompt reaches the core generation model, a lightweight pre-processing layer should normalize structured input (JSON, YAML, markdown lists) into natural language equivalents. This would allow power users to write structured prompts for clarity and organization, while the model receives input in its native format. The normalization layer could itself be a small fine-tuned LLM.
02

Harden the style-lyric generation boundary

The 36% leakage rate in JSON prompts — and the 4% residual in natural language — suggests insufficient isolation between style conditioning and lyric generation. V6 should explore stronger prompt boundary tokens, separate attention heads for style vs. lyric processing, or a two-stage generation architecture where style conditioning is finalized before lyric generation begins.
03

Train on structured prompt ↔ natural language pairs

If structured prompts are a meaningful use case for Suno's power user base, the training data should reflect this. Adding a fine-tuning dataset of JSON/structured prompts paired with their natural language equivalents — and their corresponding high-quality outputs — would teach the model to extract musical intent from structured formats as effectively as from prose.
04

Introduce genre-aware prompt strategy suggestions

Given the data showing JSON may outperform natural language in highly-structured genres (electronic, drill), V6 could implement a genre-detection system that recommends optimal prompt strategy based on musical context. Feel-driven genres would be nudged toward holistic natural language; structure-driven genres could leverage more precise parameter specification.
05

Surface emergent musical relationships, not just components

The deepest finding in this research is that musical feel is an emergent property — it cannot be fully specified component by component. V6's training data curation should prioritize examples that demonstrate groove, feel, and emotional coherence as holistic targets, not as aggregations of technical parameters. This is a training philosophy shift, not just an architecture change.

06 — Closing Note

A note on methodology

This experiment has real limitations. Evaluation was conducted by a single researcher without blind inter-rater reliability testing. Sample size (n=50) is sufficient for directional findings but not for statistical significance claims. The musical concept used (90s R&B / Neo-Soul) may not generalize to all genres.

What this research offers is not proof, but a structured, reproducible observation from a user with deep platform experience — the kind of signal that internal teams may not be able to generate themselves due to proximity bias and optimization tunnel vision.

The author has generated 220+ songs on Suno over three months, including daily high-volume generation (30+ songs/day) with systematic quality filtering. The patterns described here have been observed consistently across hundreds of additional informal tests beyond this formal experiment.

To the Suno team: This research was written in the spirit of genuine collaboration. The platform has enabled a new kind of musical creativity that didn't exist before. These findings are offered as an attempt to help push it further — from impressive to extraordinary. The path from V5 to V6 runs directly through understanding how users actually prompt, not just how they ideally should.

Phoenix Yin

AI MUSIC ARTIST · RESEARCHER · CREATOR

Based in Santa Monica, CA. Generated 220+ songs on Suno over 3 months with daily high-volume creation (30+ songs/day). R&B / Dream — turning memories into sound.

@PhoenixYinMusic →