Evaluating Multimodal LLMs That Can... Think? with Gemini 2.0 Flash Thinking, the MATH Vision Dataset, and LLM as a Judge

Did I say "Think"? I must've meant "Reason". A little over a week ago Google announced their new Gemini 2.0 multimodal models for general use (and free of charge!). These are Google's first multimodal models for public use. Previously you would have to have multiple seperate models, each handling a different task and then coming together for a final result. This means we can now work with one model which can process text, images, and speech as both inputs and outputs.

They've also quietly released Gemini 2.0 Flash Thinking as an experimental model. Not only is this model super fast, it also provides it's "chain of thought" reasoning by default. The demo notebooks on their GitHub page were impressive but short. I wanted to put it to the test and decided I'd throw it some brain teasers as well as evaluate it against the MATH Vision dataset found though HuggingFace. This data set contains over 3000 visual math problems of varying complexity and from various categories, but I'll only be using a subset of about 300. I'll also be using the LLM as a judge technique to evaluate the new Gemini model at scale. This way I won't have to manually grade the responses... I gave up grading papers years ago! I'll also try to go over this at the introductory level so that it's accessible to all.

You can find more information on the new Gemini 2.0 Flash Thinking model here:
https://ai.google.dev/gemini-api/docs/thinking-mode

More on the MATH Vision dataset here:
https://huggingface.co/datasets/MathLLMs/MathVision

Google's Demo Notebook here:

https://github.com/google-gemini/cookbook/blob/main/gemini-2/thinking.ipynb

And of course my GitHub Repo & Notebook here:

https://github.com/sam-tritto/evaluating-multimodal-gemini/blob/main/Multimodal_Thinking.ipynb

Outline

Python Environment
Google API Key
Loading Gemini in Jupyter
Ten Brain Teasers for Kids
The MATH Vision Dataset from HuggingFace
LLM as a Judge
Final Grade
Analyzing the Results

Python Environment

This is really the only section where I'm going to skip out on the details. If you are new to python programming, you are going to need someplace to write code (an IDE). I like to work in VS Code and use Jupyter notebooks, but Google Colab might be a more beginner friendly option. You'll also need package and environment management, I like to use uv, pip, and sometimes Anaconda. If this is all new to you then I'd suggest using Google Colab and Google-ing something like "How to pip install packages in Google Colab". That's all you really need to be able to do for this tutorial.

Google Colab: https://colab.research.google.com/

Before you can run any code you'll first need to run some inline terminal commands as seen below to install all the necessary packages for this tutorial. After running this cell all of the packages should be installed and a good practice would be to restart your kernel.

Google API Key

The next thing you'll need is a Google Cloud Platform project id and API key. If you don't have an account yet go to their website and sign up for free, they might ask you to set up Billing when you proceed to request an API key, but don't worry this model is free to use and there will be no charges for the requests made (just don't switch to a non-free model).

Google Cloud Platform: https://console.cloud.google.com/
How to set up a new project and billing: https://developers.google.com/workspace/guides/create-project

Once you've set that up you'll need to Navigate to API's & Services, then Credentials. Use the hamburger in the top left corner.

Click + Create Credentials

Then choose API Key

Choose the Vertex AI API

Then a pop up should appear with your API Key.

Copy this down for later.

You can either paste this directy into your notebook here:

Or store it in Google Colab's Secrets following this tutorial:

https://github.com/google-gemini/cookbook/blob/ace3178ac0e3afe5cafc9be9f9ea7fff747d846d/quickstarts/Authentication.ipynb

Loading Gemini in Jupyter

Now that that's all taken care of you can insert your API key and choose your model. Here are also some other and older models commented out, but a reminder that only 2.0 is free right now.

Ten Brain Teasers for Kids

Google's demo notebook was super impressive. It showcases the Thinking model's understanding of Mathematical notation, Geometry problems, and a Brain Teaser. I wanted to poke at these categories a bit more starting with the Brain Teasers. The example in their notebook involves a solution where the user needs to understand that they must flip the 9 ball over to be a 6, which the Thinking model got correct. Very impressive!

I decided to test out 10 brain teasers for kids found on the following website, which came with answers so that I could manually evaluate the responses. Since these brain teasers have both visual and textual components only a multimodal model would be appropriate. Also since these are "tricky" it was a great opportunity to test out the chain of thought reasoning in the models output.

Ten Brain Teasers for Kids:

https://www.teachstarter.com/us/blog/10-visual-brainteasers-kids-will-love/

Here are 3 examples from this set:

You can find the image files in my GitHub repo so please go and download those. To feed this content into the LLM a simple loop is all we need since there are only 10. I won't go into any depth here but you can see that in the generate_content() function you can pass both the image and text prompt ("contents") separately.

The thought process was directly proportional to the difficulty of the problem. Unfortunately, the thinking model didn't do so great here. Maybe about 20-30% accuracy... but these are hard! Even for me. I found myself coming to some of the same answers as the Gemini model even if we were both wrong.

The MATH Vision Dataset from HuggingFace

Next I wanted to test out the Math problem solving capabilities. There is a great dataset with over 3000 visual Math problems which is hosted on HuggingFace. The current leaderboard has the Claude 3.5 Sonnet model as the highest ranking model with an accuracy of 37.99% and the Human accuracy of 68.82%. These are the scores to beat.

MATH Vision dataset on HuggingFace:
https://huggingface.co/datasets/MathLLMs/MathVision

To load the dataset you can follow the Quick Start in their documentation. There are actually 2 datasets; the test (3040 questions) dataset and the testmini (304 questions) dataset. Since this is just a tutorial I'm only going to evaluate the model against the testmini set.

You can see some sample records below. The images are contained in a zip file and will need to be downloaded. I've downloaded them in my GitHub repo for you. You can see there are questions, answers, sometimes solutions, and sometimes answer options (like multiple choice) as well as various subjects and question levels.

Inspecting the images you can get a sesne of the types of questions that might be in the dataset. Perfect to evaluate a multimodal model.

LLM as a Judge

Like I mentioned earlier, I'm done grading papers! It would be very difficult to evaluate all 3040 question and answers, even the smaller set of 304. In other words, we need a way to evaluate the model that will "scale". Once such method is the LLM as a Judge method where we use a second LLM to grade the response of the first LLM. Since these are Math problems I'm going to call one the Student and one the Teacher.

To start this off we'll need to specify the file path of the images, set up our models, and instantiate some variables for rate limiting. I'm using the non-Thinking model as the evaluator or Teacher since I'm not interested in the reasoning from the Teacher. I will however feed the reasoning of the Student into the Teacher's prompt.

Starting the loop we can add the answer options (only if there are any), load the image, and feed the image and question to the Student LLM. I haven't tuned any of the parameters such as temperature, Top p, Top k, etc... this is out of the box. But it will work well untuned. Since this is factual a low temperature would have made sense here.

Once the Student generates a response we can grab their thoughts and answer separately for storage, but combine them for the Teacher. Similarly the dataset sometimes provides a solution as well as an answer. We'd like to add the solution to the answer, but only if it exists and then feed this to the Teacher.

Next is the Teacher prompt where I feed it the actual answer as well as the Student's solution and answer. Notice I'm only feeding this model a text prompt, no image. I ask it to explicitly only answer as 'Correct' or 'Incorrect'. Again I haven't tuned any of the parameters, but a low temperature would have made sense here. Once we grab the Teacher's grade we're done and can move onto the next question.

The last bit is for handling the rate limit. It's not super elegant but it does the job. The Gemini 2.0 model has a free tier rate limit of 10 RPM (requests per minute). You can find more info in the model card below. I've set this to 70 seconds for an extra buffer. Since there are 2 models we also need to iterate by counts of 2.

The whole 304 records took about an hour, but if I were doing the full dataset this approach would be impractical. That's where batch processing comes in handy. The code for the batch processing approach isn't beginner friendly and doesn't lend itself well for a tutorial... but it is quick! It involves Google Cloud Storage and JSONs, so if these are familiar to you - check that approach out.

Gemini 2.0 Flash Model Card:
https://ai.google.dev/gemini-api/docs/models/gemini#gemini-2.0-flash

Final Grade

Whoa!! Oh my gosh we have potentially a new leader on the board with an estimated accuracy of... 51.32%. That's an incredible leap forward for Gemini 2.0, an increase of 13.35 percentage points (or a 35% increase). Of course we'll still need an official evaluation on the full dataset, but it's only 17.48 percentage points away from the Human accuracy. Very cool.

Analyzing the Results

Here's a quick cell to look into all components from any question.

And maybe we can investigate which levels and subjects the model did well and not so well on.

Great at arithmetic and very poor at combinatorial geometry.

Interestingly it seems that the model did much better on the level 5's than the level 1's. I'm assuming that 5's are harder problems, but I'll have to take a deeper look at their documentation and research paper to know for sure.

Well that's it really. If you want to take this further you could try experimenting with the prompt, the parameters (temperature), the models, using the full dataset, or maybe even batch processing. You can also download your own images or brain teasers and try to trick the robot... have fun!

Next project

Page updated

Google Sites

Report abuse