< Back

Gemini 2.5 Pro Joins Nexar’s AI Challenge — And Breaks Ahead

Last month, we asked: Can AI models truly grasp real-world driving incidents? With Nexar’s sensor-enriched dashcam videos, the feedback was overwhelming. Now, our updated benchmark features Gemini 2.5 Pro—a “thinking model” from Google DeepMind. Discover fresh insights into its performance.

Last month, we released our first benchmarking post to answer a key question: Can today’s AI models understand the complexity of real-world driving incidents?

Using a curated set of sensor-enriched dashcam videos selected from Nexar’s extensive dataset of over 60 million videos—including rare collisions, near-misses, and edge-case scenarios—we exposed where current models excel and where they fall short. The response was overwhelming: researchers, developers, and safety experts around the world reached out to tell us how much they appreciated the transparency and depth of our benchmark.

Today, we’re excited to announce that we’ve expanded our benchmark to include the new Gemini 2.5 Pro.

Released in March 2025, Gemini 2.5 Pro is Google DeepMind’s most advanced model to date. According to Google's description:

“Gemini 2.5 is a thinking model, designed to tackle increasingly complex problems. Our first 2.5 model, Gemini 2.5 Pro Experimental, leads common benchmarks by meaningful margins and showcases strong reasoning and code capabilities.”

In this blog, we evaluate the new Gemini model in the context of the AV domain.

Benchmark Performance 

Our updated benchmark includes Gemini 2.5 Pro alongside its predecessors—Gemini 2.0 Pro and Gemini 1.5 Pro. We re-ran the full suite of tests using Nexar’s real-world video dataset, and report the new model performance here below.

Our benchmark comprises six questions: four of them focus on the general domain (weather conditions, lighting conditions, location, and zone), and two are domain-specific, focusing on the AV domain (main events and vehicle density). We measure F1 scores for each of those as illustrated below. 

 F1 Score comparison across categories. Gemini 2.5 Pro shows sharp gains only in complex tasks

Our main observations:

  • Gemini 2.5 Pro shows a dramatic jump in the main-event category, marking a breakthrough in real-world incident detection. This improvement comes after a significant jump that was also observed between Gemini 1.5 and 2. 
  • Slight gains were observed for the vehicle density and weather conditions categories. 
  • The results for the rest of the general domain questions (light conditions, light conditions and zone) remain roughly the same, possibly because they are approaching the performance ceiling.

Gemini Models F1 Score Over Time 

This chart shows just the Gemini Pro series to illustrate their rate of improvement. We are seeing a constant and significant improvement in the total F1 score of the Gemini models within just a few months, as well as in their ability to answer domain-specific questions from the AV domain. The future looks very promising.

Note: While this graph focuses solely on Gemini models, our full leaderboard includes a wide range of leading Vision-Language Models (VLMs)—such as GPT-4o, Claude Sonnet, Qwen2, and LLaMA. You can view and compare them on the leaderboard, along with category-level breakdowns, as well as inference time and cost metrics.

Why Nexar’s Dataset Is the Key Differentiator

We believe Gemini 2.5 Pro’s success is not just about the model itself—it’s about the data it's tested on. Our dataset includes rare events that don’t appear in standard training corpora, rich sensor data (video + GPS + IMU + audio) and human-annotated labels at frame-level granularity. 

This makes it uniquely suited for stress-testing multimodal AI models in real-world driving scenarios. Our benchmark doesn’t just reward models for getting easy answers right—it challenges them to understand the road.

Visit our leaderboard here: Nexar Driving Leaderboard

And if you're building the next great model - we're ready to test it.