Evaluating Multimodal LLMs That Can... Think? with Gemini 2.0 Flash Thinking, the MATH Vision Dataset, and LLM as a Judge