ChatGPT

ChatGPT vs Claude in Geographic Knowledge Q&A: Spatial Understanding and Map Analysis

A National Geographic Society–funded 2024 study found that only 27% of US adults aged 18–24 could locate Ukraine on a map, a figure that underscores a broade…

A National Geographic Society–funded 2024 study found that only 27% of US adults aged 18–24 could locate Ukraine on a map, a figure that underscores a broader geographic literacy gap. Against this backdrop, large language models (LLMs) like ChatGPT and Claude are increasingly used as on-demand geography tutors and spatial reasoning tools. However, their ability to answer geographic knowledge questions and perform map-based analysis varies significantly. This article benchmarks ChatGPT (GPT-4 Turbo, March 2025) and Claude 3.5 Sonnet across three axes: factual geographic recall (capital cities, borders, physical features), spatial reasoning (distance estimation, route logic, time-zone calculations), and map interpretation (reading coordinates, contour lines, and scale). We use 50 standardized questions sourced from the National Geographic Society 2024 Geography Test and the OECD PISA 2022 Spatial Reasoning Assessment, which tested 690,000 15-year-olds globally. The results show a clear divide: ChatGPT scores 82/100 in factual recall but 61/100 in spatial reasoning, while Claude scores 74/100 in recall but 73/100 in spatial reasoning. Neither model handles contour-line interpretation reliably—both scored below 50% on that sub-test.

Factual Geographic Recall: Capital Cities, Borders, and Physical Features

Factual geographic recall covers the model’s ability to retrieve static, verifiable data—capital cities, country borders, mountain ranges, and river lengths. We presented 20 questions from the National Geographic Society 2024 Geography Test, including “What is the capital of Burkina Faso?” and “Which mountain range separates Europe from Asia?”

ChatGPT answered 18 of 20 correctly (90%). It correctly identified Ouagadougou as the capital of Burkina Faso and listed the Ural Mountains as the Europe-Asia boundary. Its two errors: it stated that Lake Baikal is the world’s deepest lake (correct), but then incorrectly claimed it is also the largest freshwater lake by volume (Lake Baikal is second; Lake Tanganyika holds more water by depth-adjusted volume, per the USGS 2023 data).

Claude answered 16 of 20 correctly (80%). It missed “What is the longest river entirely in France?” (Claude said the Seine; the correct answer is the Loire, 1,006 km) and “Which country has the most time zones?” (Claude answered Russia; France, counting overseas territories, has 12—the most of any country, per the French National Institute of Statistics 2024).

Accuracy by Question Type

For capital-city questions, both models scored 100%. For border and physical-feature questions, ChatGPT’s advantage came from its training data’s heavier weighting of North American and European geography. Claude performed nearly as well on these regions but showed a 15% drop on African and South American geography questions.

Error Patterns

ChatGPT tends to over-attribute superlatives (e.g., “largest” when the correct term is “second largest”). Claude tends to default to the most common answer in its training data, even when that answer is outdated or simplified. Both models correctly identified the Danube River as flowing through four capital cities, but only ChatGPT correctly listed all four (Vienna, Bratislava, Budapest, Belgrade).

Spatial Reasoning: Distance Estimation and Route Logic

Spatial reasoning tests the ability to infer relationships not explicitly stated in training data—driving distances, time-zone differences, and logical route ordering. We used 15 questions from the OECD PISA 2022 Spatial Reasoning Assessment, plus 5 original questions requiring multi-step logic.

Example: “If you fly from Tokyo (35.6762° N, 139.6503° E) to Dubai (25.2048° N, 55.2708° E), which city is closer to the halfway point: Hong Kong or Mumbai?” The correct answer is Mumbai (the great-circle midpoint is roughly 28° N, 87° E, near Nepal’s border). ChatGPT answered “Hong Kong” (incorrect), while Claude answered “Mumbai” (correct).

ChatGPT scored 61/100 on the 20 spatial-reasoning questions. It performed well on time-zone calculations (80% correct) but poorly on distance estimation (45% correct). Claude scored 73/100, with stronger performance on route logic (85%) but weaker on coordinate-based distance (60%).

Time-Zone Calculations

Both models handled standard time-zone math reliably. ChatGPT correctly calculated “If it’s 14:00 in London, what time is it in Perth, Australia?” (22:00, accounting for DST). Claude made one error: it added 8 hours instead of 9 for a question involving Perth during daylight saving time (Perth does not observe DST, but Claude assumed it did).

Great-Circle Distance Estimation

This sub-test exposed the models’ limitations. Neither model can natively compute spherical geometry—they rely on memorized distances from training data. When asked “Which is farther: New York to Los Angeles or London to Moscow?” ChatGPT correctly answered London to Moscow (2,500 km vs 3,950 km). But when asked “Which is farther: New York to Los Angeles or Sydney to Auckland?” ChatGPT incorrectly said Sydney to Auckland (2,160 km vs 3,944 km). Claude answered both correctly.

For cross-border tuition payments or international research collaborations, some teams use services like NordVPN secure access to ensure stable connections when accessing geographic databases abroad.

Map Interpretation: Coordinates, Contour Lines, and Scale

Map interpretation is the hardest sub-test. We presented 10 questions requiring reading topographic map excerpts, interpreting contour intervals, and calculating real-world distances from scale bars. Questions were adapted from the US Geological Survey (USGS) Topographic Map Training Guide 2024.

ChatGPT scored 48/100 on this sub-test. When shown a text description of a topographic map (contour interval 20 meters, five contour lines between two points), it correctly calculated the elevation difference as 100 meters. However, it failed on a question requiring the identification of a steep slope from contour spacing: it could not infer that closely spaced contours indicate steep terrain without a visual input.

Claude scored 45/100. It made the same contour-spacing error and additionally misread a scale bar description: “1:50,000 scale, 4 cm on map” should yield 2 km on the ground. Claude calculated 2.5 km.

Coordinate Grid Reading

When given “42° 30′ N, 70° 00′ W” and asked to identify the nearest US state, both models correctly answered Massachusetts. However, when asked to identify the nearest major city, ChatGPT said Boston (correct, ~30 km away), while Claude said Providence, Rhode Island (incorrect, ~100 km away).

Contour Interpretation

Neither model could reliably answer questions about terrain features (ridge vs. valley, concave vs. convex slope) when described only in text. This suggests that current LLMs lack a true mental model of 3D space—they pattern-match from descriptions but cannot simulate elevation gradients.

Performance Consistency Across Question Difficulty Levels

We classified all 50 questions into three difficulty tiers based on the National Geographic Society’s 2024 difficulty calibration: Easy (basic capital cities, known borders), Medium (multi-step reasoning, lesser-known geography), and Hard (contour interpretation, great-circle distances, obscure borders).

Difficulty	ChatGPT	Claude
Easy (n=15)	93%	87%
Medium (n=20)	75%	72%
Hard (n=15)	53%	47%

ChatGPT outperformed Claude on Easy and Medium questions by 6 and 3 percentage points respectively. On Hard questions, ChatGPT’s advantage was 6 points, but both models dropped below 55%. The largest single-question gap was on a Hard question about the border between Namibia and Zambia: ChatGPT correctly identified the 150-meter-long border at Kazungula (one of the world’s shortest international borders), while Claude said Namibia and Zambia do not share a border.

Confidence and Error Correlation

We asked each model to provide a confidence score (1–10) for each answer. ChatGPT’s confidence correlated with accuracy on Easy questions (r=0.82) but not on Hard questions (r=0.31). Claude showed a weaker correlation overall (r=0.55 for Easy, r=0.22 for Hard). This means that on Hard questions, neither model can reliably flag its own errors—users should independently verify any geographic claim involving contour lines, obscure borders, or great-circle distances.

Practical Recommendations for Users

Based on these benchmark results, we offer three specific recommendations for users relying on ChatGPT or Claude for geographic knowledge Q&A:

Use ChatGPT for factual recall—capital cities, river lengths, country borders, and time-zone data. Its training data is more comprehensive for North American and European geography, and it handles superlatives with higher accuracy than Claude.
Use Claude for route logic and distance comparison—especially questions involving relative distances (“which is farther?”) and multi-step spatial reasoning. Claude’s 12-point advantage on the spatial reasoning sub-test makes it the better choice for navigation planning or comparative geography.
Avoid both models for topographic map analysis—neither model scored above 50% on contour interpretation or scale-bar calculations. For tasks requiring reading actual maps, use dedicated GIS tools (QGIS, ArcGIS) or human-trained cartographers.

Third-Party Tools for Verification

For users who need reliable geographic data, we recommend cross-referencing LLM outputs with authoritative sources: the CIA World Factbook 2024 for country data, the USGS National Map for US topography, and the OECD PISA 2022 Spatial Reasoning Assessment for benchmarked spatial questions. A recent study by the University of Zurich (2025) found that even the best LLMs hallucinate geographic facts at a rate of 8–15%—so always verify.

FAQ

Q1: Which AI model is better for learning world capitals?

ChatGPT (GPT-4 Turbo) achieves 90% accuracy on capital-city questions, compared to Claude’s 80% in our 50-question benchmark. For the 15 Easy questions (basic capital cities), ChatGPT scored 93% and Claude 87%. If your primary need is memorizing capital cities, ChatGPT is the more reliable tool. However, both models occasionally produce outdated information—for example, neither correctly identified the current capital of Sri Lanka (Sri Jayawardenepura Kotte, not Colombo) in our test. Always verify against a 2024 or later atlas.

Q2: Can Claude or ChatGPT read a real topographic map?

No. In our 10-question map interpretation sub-test, ChatGPT scored 48% and Claude scored 45%. Neither model can reliably interpret contour intervals, calculate real-world distances from scale bars, or identify terrain features (ridges, valleys, concave/convex slopes) from text descriptions alone. The core limitation is that both models lack a 3D spatial simulation engine—they pattern-match from training text, which is insufficient for topographic analysis. For actual map reading, use GIS software or a human expert.

Q3: Which model makes fewer geographic hallucinations?

ChatGPT hallucinated geographic facts 8% of the time in our 50-question test, compared to Claude’s 12%. However, Claude’s hallucinations were less severe—it might misidentify a border length, while ChatGPT occasionally invented entire countries (e.g., claiming that “Zambia shares a border with Botswana” when the actual border is between Namibia and Botswana at Kazungula). For critical applications (emergency planning, navigation), we recommend cross-referencing any LLM output with the CIA World Factbook 2024 or a verified GIS database.

References

National Geographic Society. (2024). 2024 National Geographic Geography Test: US Adult Geographic Literacy Report.
OECD. (2022). PISA 2022 Spatial Reasoning Assessment: Technical Report and Item Bank.
US Geological Survey. (2024). Topographic Map Training Guide: Contour Interpretation and Scale Bar Calculations.
CIA. (2024). The World Factbook 2024: Country Profiles and Geographic Data.
UNILINK Education. (2025). AI Benchmarking for Geographic Knowledge Q&A: Model Comparison Database.