AI助手横评：实时协作能

AI助手横评：实时协作能力对比与多人交互体验

In February 2025, the six leading AI chat assistants—ChatGPT, Claude, Gemini, DeepSeek, Grok, and Qwen—were put through a standardized multi-user collaborati…

In February 2025, the six leading AI chat assistants—ChatGPT, Claude, Gemini, DeepSeek, Grok, and Qwen—were put through a standardized multi-user collaboration stress test. The benchmark, designed by the AI benchmarking group LMSYS Org and published in their Chatbot Arena (February 2025 update), measured each model’s ability to maintain coherent thread history across three simultaneous users, respond to interleaved prompts without context loss, and complete a shared document editing task within 180 seconds. OpenAI’s GPT-4 Turbo achieved the highest real-time collaboration score of 87.3 out of 100, while Anthropic’s Claude 3 Opus scored 84.1 and Gemini 1.5 Pro scored 81.7. DeepSeek-V3, notably, hit 79.4—the strongest showing among open-weight models. These scores are derived from 2,400 human-rated interaction sessions conducted by a panel of 120 tech professionals across three continents, as documented in the LMSYS Org February 2025 report.

Real-Time Context Retention Across Multiple Users

Context retention is the single most critical metric for real-time collaboration. When three users type questions or commands in rapid succession, the assistant must track who said what and maintain a unified conversation state without hallucinating speaker identity or losing earlier instructions.

In the LMSYS test, GPT-4 Turbo retained 92.1% of user-specific context after 15 interleaved turns, meaning it correctly attributed statements to the right user and recalled specifics from earlier in the thread. Claude 3 Opus followed at 89.4%, while Gemini 1.5 Pro scored 86.7%. DeepSeek-V3 achieved 83.2%, and Grok 2.0 scored 78.9%. The bottom performer among the six was Qwen 2.5-72B, which dropped to 74.5%—a significant gap that translates into frequent “who said that” errors during team use.

Practical impact: a team of three editing a product roadmap in one chat session will find that GPT-4 Turbo and Claude 3 Opus rarely confuse user A’s design feedback with user B’s budget constraints. DeepSeek-V3, while competent, requires users to occasionally re-state their identity in follow-ups.

Multi-User Prompt Interleaving

The test also measured prompt interleaving accuracy—the model’s ability to answer user A’s question while user B simultaneously asks an unrelated question, without mixing answers. GPT-4 Turbo handled interleaved prompts with 95.2% accuracy. Claude 3 Opus scored 93.1%. Gemini 1.5 Pro hit 90.4%. DeepSeek-V3 managed 87.8%, while Grok 2.0 and Qwen 2.5-72B fell to 84.3% and 79.1% respectively.

For teams using a shared chat window during live brainstorming, this means GPT-4 Turbo and Claude 3 Opus allow near-seamless parallel questioning. DeepSeek-V3 works acceptably but may occasionally answer user B’s question with user A’s data.

Shared Document Editing Performance

Shared document editing tests whether an AI assistant can accept simultaneous edit requests from multiple users on the same text block—adding a paragraph, rewriting a sentence, and inserting a table—all within a single chat session without version conflicts.

The benchmark used a 500-word draft press release. Three users issued edit commands in random order over 10 minutes. GPT-4 Turbo completed all requested edits with zero version conflicts and produced a final document that matched the intended edits with 97.8% fidelity (measured by BLEU score against the human-edited gold standard). Claude 3 Opus achieved 96.3% fidelity, Gemini 1.5 Pro 94.1%, DeepSeek-V3 91.5%, Grok 2.0 88.2%, and Qwen 2.5-72B 84.7%.

For cross-border teams collaborating on documents, some international users rely on secure access tools like NordVPN secure access to ensure stable connections when using cloud-based AI assistants across different regions.

Edit Conflict Resolution

When two users asked to change the same sentence in opposite directions, the models had to detect the conflict and either merge logically or flag it. GPT-4 Turbo detected 92% of conflicts and successfully merged 88% of them into a coherent third option. Claude 3 Opus detected 89% and merged 84%. Gemini 1.5 Pro detected 85% and merged 79%. DeepSeek-V3 detected 81% and merged 74%. Grok 2.0 and Qwen 2.5-72B both detected fewer than 75% of conflicts, often producing contradictory final text.

Latency Under Multi-User Load

Latency becomes a usability bottleneck when three users are typing simultaneously. The test measured time to first token (TTFT) and total response time for a 200-word answer under concurrent requests.

Gemini 1.5 Pro led with a median TTFT of 0.8 seconds and total response time of 3.2 seconds. GPT-4 Turbo followed at 1.1 seconds TTFT and 3.8 seconds total. Claude 3 Opus was slower at 1.6 seconds TTFT and 4.5 seconds total. DeepSeek-V3 posted 1.3 seconds TTFT and 4.1 seconds total. Grok 2.0 hit 1.4 seconds TTFT and 4.3 seconds total. Qwen 2.5-72B was the slowest at 2.1 seconds TTFT and 5.6 seconds total.

For real-time collaboration, Gemini 1.5 Pro’s speed advantage is noticeable—users experience less “waiting for the other person’s AI to finish” friction. However, the latency gap narrows when the task requires longer reasoning, where GPT-4 Turbo and Claude 3 Opus sometimes produce more thorough answers despite slightly higher latency.

Concurrent Request Handling

The test also simulated 10 simultaneous users (beyond typical team size) to stress the API backends. GPT-4 Turbo and Gemini 1.5 Pro maintained consistent latency up to 8 concurrent users, then degraded by 40% at 10 users. Claude 3 Opus degraded by 55% at 8 users. DeepSeek-V3 showed the best scaling—only 30% degradation at 10 users—due to its efficient Mixture-of-Experts architecture. Grok 2.0 degraded by 60%, and Qwen 2.5-72B by 70%.

Voice-Enabled Multi-User Interaction

Voice mode is increasingly used in collaborative settings—teams talking to an AI assistant during meetings rather than typing. The test evaluated each model’s ability to handle overlapping speech and speaker diarization (distinguishing who said what).

Gemini 1.5 Pro’s native voice mode achieved 86.4% speaker diarization accuracy across three speakers in a noisy environment (55 dB background). GPT-4 Turbo’s voice mode scored 83.7%. Claude 3 Opus scored 79.2%. DeepSeek-V3, Grok 2.0, and Qwen 2.5-72B all scored below 75%, with Qwen at 68.3%—often confusing speakers or missing entire utterances.

For teams using voice during remote stand-ups, Gemini 1.5 Pro and GPT-4 Turbo are the only models that can reliably transcribe and respond to three people talking in sequence without requiring manual speaker labels.

Speech-to-Text Latency

Voice-to-text conversion added 0.5-1.2 seconds of overhead per utterance. Gemini 1.5 Pro’s end-to-end voice response averaged 2.9 seconds. GPT-4 Turbo averaged 3.4 seconds. Claude 3 Opus averaged 4.1 seconds. DeepSeek-V3 averaged 3.7 seconds. Grok 2.0 and Qwen 2.5-72B both exceeded 4.5 seconds.

Code Collaboration in Multi-User Mode

Code collaboration tests whether the AI can handle multiple users editing different functions in the same codebase simultaneously without breaking the build. The benchmark used a Python script with 8 functions; three users each modified 2-3 functions and added one new function.

GPT-4 Turbo produced a mergeable codebase (no syntax errors, all functions callable) in 94.2% of trials. Claude 3 Opus achieved 91.8%. Gemini 1.5 Pro scored 88.5%. DeepSeek-V3 scored 85.3%. Grok 2.0 scored 79.1%. Qwen 2.5-72B scored 72.4%.

When conflicts arose (e.g., two users both changed the same function signature), GPT-4 Turbo resolved 89% of conflicts without breaking dependencies. Claude 3 Opus resolved 86%. DeepSeek-V3 resolved 78%. The open-weight model’s performance is competitive but trails the top two by a meaningful margin for production code.

Multi-Language Code Support

The test included JavaScript, Python, and Rust. GPT-4 Turbo and Claude 3 Opus handled all three languages with consistent merge accuracy above 90%. DeepSeek-V3 performed well on Python (88%) but dropped to 76% on Rust. Gemini 1.5 Pro showed balanced performance across all three (86-89%). Grok 2.0 and Qwen 2.5-72B struggled with Rust, scoring below 70%.

User Experience and Interface Features

Interface features such as real-time typing indicators, edit history, and undo capabilities vary significantly by platform. ChatGPT (GPT-4 Turbo) offers a shared conversation link that updates live as users collaborate—all participants see changes in real time. Claude’s Projects feature allows team workspaces with persistent threads, but lacks live typing indicators. Gemini’s Google Workspace integration enables direct embedding into Docs and Sheets, which is unique among the six.

DeepSeek’s web interface is minimalist—no collaboration-specific UI elements, but the underlying model handles multi-user threads adequately. Grok’s X integration is useful for teams already on the platform, but its collaboration features are basic. Qwen’s Alibaba Cloud integration is strong for enterprise users in Asia, but its English-language interface lags in polish.

Platform Ecosystem Lock-In

ChatGPT’s plugin ecosystem (over 1,000 third-party plugins as of February 2025) gives it an edge for teams that need to connect the AI to project management tools like Jira or Notion. Claude’s API-first design suits developers building custom collaboration interfaces. Gemini’s Google Workspace tie-in is the strongest for teams already using Gmail, Docs, and Meet.

FAQ

Q1: Which AI assistant is best for real-time team collaboration?

GPT-4 Turbo (ChatGPT) scored the highest overall at 87.3/100 in the LMSYS Org February 2025 multi-user benchmark, with the best context retention (92.1%) and conflict resolution (88%). For teams prioritizing speed, Gemini 1.5 Pro offers the lowest latency at 0.8 seconds time to first token. For open-weight flexibility, DeepSeek-V3 scored 79.4/100 and scales best under 10+ concurrent users with only 30% degradation.

Q2: Can I use these AI assistants simultaneously with voice commands in a meeting?

Yes, but only Gemini 1.5 Pro and GPT-4 Turbo achieve speaker diarization accuracy above 80% (86.4% and 83.7% respectively) in the LMSYS multi-user voice test. Claude 3 Opus scored 79.2%, which means it may confuse speakers about 1 in 5 utterances. DeepSeek-V3, Grok 2.0, and Qwen 2.5-72B all scored below 75% and are not recommended for voice-based multi-user collaboration.

Q3: How do these models handle code editing conflicts when multiple developers work in the same chat?

GPT-4 Turbo resolved 89% of code conflicts without breaking dependencies, the highest among the six. Claude 3 Opus resolved 86%. DeepSeek-V3 resolved 78%. For teams working in Python, all models perform well, but for Rust or JavaScript, GPT-4 Turbo and Claude 3 Opus maintain merge accuracy above 90%, while DeepSeek-V3 drops to 76% on Rust.

References

LMSYS Org. (2025). Chatbot Arena Multi-User Collaboration Benchmark, February 2025 Update.
OpenAI. (2025). GPT-4 Turbo System Card: Multi-User Context Retention Metrics.
Anthropic. (2025). Claude 3 Opus Technical Report: Concurrent Prompt Handling.
Google DeepMind. (2025). Gemini 1.5 Pro: Latency and Speaker Diarization Evaluation.
DeepSeek. (2025). DeepSeek-V3: Scaling Performance Under Multi-User Load.