ChatGPT vs C

ChatGPT vs Claude vs Gemini：三大AI助手在创意写作中的表现对比

In a controlled benchmark test conducted by **Stanford's Center for the Study of Language and Information** in January 2025, three leading AI assistants—Chat…

In a controlled benchmark test conducted by Stanford’s Center for the Study of Language and Information in January 2025, three leading AI assistants—ChatGPT (GPT-4o), Claude (Sonnet 3.5), and Gemini (Ultra 1.0)—were evaluated across 12 creative writing tasks, including short story generation, poetry composition, dialogue creation, and marketing copy. The results revealed a 37% variance in human-rated quality scores between the highest and lowest performers, with Claude leading in narrative coherence (8.9/10) and ChatGPT excelling in prompt adherence (94.2% task completion rate), while Gemini scored highest in stylistic diversity (7 distinct tones across 10 prompts). According to Pew Research Center’s 2024 AI Adoption Survey, 62% of professional writers now use generative AI tools at least weekly, yet only 23% trust these outputs for final drafts. This head-to-head comparison breaks down exactly where each model wins, where it stumbles, and which specific tasks you should assign to which assistant—backed by reproducible benchmarks, not marketing claims.

Narrative Coherence: Claude Leads, ChatGPT Edges Gemini

Claude scored 8.9/10 in human-rated narrative coherence during the Stanford benchmark, outperforming ChatGPT (8.2) and Gemini (7.6). This metric measured logical plot progression, character consistency, and causal event chains across 500-word short stories. Claude maintained character voices through 15+ dialogue exchanges without drift—a failure point for both competitors.

ChatGPT performed strongest in short-form coherence (under 300 words), where its 94.2% task completion rate meant it almost never abandoned the prompt’s core request. However, in stories exceeding 800 words, ChatGPT showed a 22% character-name inconsistency rate (e.g., renaming “Dr. Chen” to “Dr. Chang” mid-story) based on internal checks by the testing team.

Gemini introduced the most unexpected plot twists—rated 8.4/10 for novelty—but suffered from logical gap errors in 34% of extended narratives. A typical failure: a character teleporting between locations without described travel, which human raters flagged as immersion-breaking. Gemini’s strength lies in generating multiple plot branches (up to 6 variations per prompt), but its coherence degrades linearly with length.

Claude’s Memory Advantage

Claude’s 100K-token context window allows it to reference details from 30+ paragraphs prior without degradation. In the test, Claude correctly recalled a minor character’s eye color mentioned 12 paragraphs earlier—ChatGPT and Gemini failed this recall test 78% and 82% of the time, respectively.

Prompt Adherence: ChatGPT’s Reliability Edge

ChatGPT achieved 94.2% task completion rate across all 12 creative tasks—meaning it delivered outputs matching the prompt’s explicit instructions nearly every time. This included strict constraints (e.g., “write exactly 4 paragraphs, each starting with ‘The’”) and nuanced ones (“maintain a detached, journalistic tone throughout”).

Claude followed at 89.1%, but its adherence dropped to 76% when prompts contained contradictory instructions (e.g., “write a happy story about a funeral”). Claude tends to prioritize emotional coherence over literal instruction—a feature for nuanced writing, but a bug for strict formatting tasks.

Gemini scored 84.7% overall, with a notable weakness in word-count accuracy: only 61% of Gemini’s outputs fell within ±10% of the requested length, compared to 88% for ChatGPT and 79% for Claude. Gemini often produced verbose responses, averaging 23% more words than requested.

Format Compliance Breakdown

For tasks requiring specific formatting (bullet lists, tables, script formatting), ChatGPT complied perfectly in 97% of cases. Claude’s compliance dropped to 88% when formatting conflicted with narrative flow (e.g., inserting a table mid-story). Gemini’s compliance was 91% but included formatting errors (e.g., broken markdown table cells) in 14% of outputs.

Creative Originality: Gemini’s Diversity vs Claude’s Depth

Gemini generated the most stylistically diverse outputs—7 distinct tones across 10 prompts in the Stanford test, compared to 5 for ChatGPT and 6 for Claude. When asked to write the same product description in “Shakespearean,” “technical manual,” “children’s book,” and “noir detective” styles, Gemini produced the most distinct voices with minimal overlap.

Claude scored highest for originality depth (8.7/10)—human raters judged Claude’s metaphors and analogies as more surprising and fitting than ChatGPT’s (7.9) or Gemini’s (8.1). Claude’s metaphors were rated 2.3x more likely to be “memorable” in a blind A/B test with 120 professional editors.

ChatGPT produced the most formulaic structures—its short stories followed a predictable “setup-conflict-resolution” arc in 89% of cases, compared to Claude’s 72% and Gemini’s 68%. However, ChatGPT’s formulaic outputs were also rated the most “coherent” and “easy to follow” by non-professional readers (n=500, via Prolific Academic).

Cliché Avoidance

Gemini avoided common clichés (“it was a dark and stormy night”) in 91% of test prompts, the highest rate. ChatGPT fell into cliché patterns 18% of the time—especially in opening lines (e.g., “In a world where…” appeared in 12% of ChatGPT stories). Claude’s cliché rate was 11%, mostly in emotional descriptions (“her heart raced”).

Dialogue and Voice Consistency: Claude’s Clear Win

Claude produced dialogue that human raters scored 9.2/10 for character voice distinctiveness. In a 6-character scene test, Claude maintained unique speech patterns (e.g., a professor’s formal diction vs a teenager’s slang) across 20+ exchanges. ChatGPT scored 8.0, with character voices converging after 12 exchanges. Gemini scored 7.4, frequently assigning the same vocabulary to all characters.

Claude’s dialogue also scored highest for subtext—characters implied meaning rather than stating it outright, rated 8.6/10 vs ChatGPT’s 7.2 and Gemini’s 6.9. This makes Claude the best choice for literary fiction or screenplay drafts.

ChatGPT outperformed in dialogue formatting compliance—it correctly used screenplay format (CHARACTER NAME: dialogue) 99% of the time vs Claude’s 93% and Gemini’s 88%. For users who need production-ready script formatting, ChatGPT is more reliable.

Accent and Dialect Handling

When prompted to write dialogue in specific dialects (e.g., Southern American, Cockney, Mumbai English), Claude produced the most authentic-sounding results (8.3/10), while ChatGPT tended toward stereotypes (6.7/10) and Gemini often defaulted to standard English regardless of instruction.

Poetry and Lyrical Writing: Mixed Results

ChatGPT won on meter and rhyme accuracy—88% of its sonnets followed proper iambic pentameter and rhyme schemes, compared to Claude’s 71% and Gemini’s 63%. ChatGPT’s limericks were correct 94% of the time, nearly double Gemini’s 48% success rate.

Claude scored highest for emotional resonance (8.5/10) in free verse poetry. Human raters (n=200 poets via Substack survey) preferred Claude’s free verse over ChatGPT’s 2:1, citing “more surprising imagery” and “better line breaks.”

Gemini produced the most experimental forms—concrete poetry, erasure poetry, and non-linear structures—but only 52% were judged “successful” by the raters. Gemini’s haiku adherence to 5-7-5 syllable structure was 79%, behind ChatGPT’s 96% and Claude’s 88%.

Rhyme Quality

ChatGPT’s rhymes were rated “natural” 84% of the time; forced rhymes appeared in 16% of its poems. Claude’s forced rhyme rate was 11%, but its rhymes were less predictable. Gemini produced the most forced rhymes (27%)—often resorting to near-rhymes or slant rhymes that broke the pattern.

Marketing Copy and Persuasive Writing: Task-Specific Winners

For short-form marketing copy (headlines, taglines, email subject lines), ChatGPT scored 9.1/10 for conversion-oriented writing in a A/B test with 50 real campaigns (via a SaaS company’s internal testing). ChatGPT’s subject lines achieved 14% higher open rates than Claude’s and 22% higher than Gemini’s in live email sends.

Claude outperformed in long-form sales pages (1,000+ words)—its persuasive arguments followed logical progression (problem → solution → proof → call to action) with 92% structural completeness, vs ChatGPT’s 84% and Gemini’s 78%. Claude also handled objection-handling sections better, anticipating 3.2 counterarguments per page vs ChatGPT’s 1.8.

Gemini produced the most creative angles—it generated 8 unique value propositions for a single product, compared to 5 for ChatGPT and 6 for Claude. However, only 38% of Gemini’s angles were deemed “plausible and effective” by marketing professionals (n=30), vs 72% for ChatGPT and 68% for Claude.

Tone Switching

When asked to rewrite the same copy for “luxury,” “budget,” “technical,” and “emotional” audiences, ChatGPT maintained the requested tone with 93% consistency. Claude’s consistency was 87%, but its luxury tone was rated “more authentic” by focus groups. Gemini’s consistency dropped to 79%, often mixing tones mid-copy.

FAQ

Q1: Which AI assistant is best for writing a novel or long-form fiction?

Claude (Sonnet 3.5) is the strongest choice for long-form fiction, scoring 8.9/10 for narrative coherence in the Stanford benchmark. Its 100K-token context window allows it to maintain character consistency across 30,000+ words without forgetting details. ChatGPT is a close second for structured plotting (94.2% task completion rate), but its character-name consistency drops 22% in stories over 800 words. For novel-length projects, Claude’s ability to recall minor details from 12+ paragraphs earlier gives it a measurable edge—it passed recall tests 78% more often than ChatGPT in controlled trials.

Q2: Can these AI assistants write publishable poetry?

Only ChatGPT reliably produces metered, rhymed poetry suitable for traditional forms—88% of its sonnets followed correct iambic pentameter and rhyme schemes. Claude’s free verse scored 8.5/10 for emotional resonance in a survey of 200 poets, but its structured poetry (haiku, sonnet) had a 29% failure rate on meter. Gemini’s experimental forms are interesting but only 52% were judged “successful” by human raters. For submission-ready traditional poetry, ChatGPT is the most reliable; for raw, emotionally resonant free verse, Claude is preferred 2:1 over ChatGPT.

Q3: Which AI generates the most original, non-cliché creative writing?

Gemini (Ultra 1.0) avoids clichés most consistently—91% of its test outputs contained no common clichés, compared to ChatGPT’s 82% and Claude’s 89%. Gemini also produces the most stylistically diverse outputs (7 distinct tones across 10 prompts). However, originality comes at a cost: Gemini’s logical coherence is lowest (7.6/10), and 34% of its extended narratives contain logical gaps. For projects requiring high novelty (e.g., brainstorming unique story premises), Gemini leads. For original writing that also needs to be publishable, Claude offers the best balance (8.7/10 originality depth, 8.9/10 coherence).

References

Stanford Center for the Study of Language and Information, 2025, “Generative AI in Creative Writing: A Controlled Benchmark”
Pew Research Center, 2024, “AI Adoption Survey Among U.S. Professional Writers”
Prolific Academic, 2025, “Reader Preference Study: AI-Generated Fiction vs Human-Written Fiction”
Substack, 2025, “Poet Survey: AI Assistance in Creative Writing”
Unilink Education, 2025, “AI Tool Usage Metrics in Professional Creative Fields”