Skip to main content

Models Grade Report

This document presents the evaluation report for various Large-Language-Models (LLMs) graded by Spice AI. The models are assessed based on their basic capabilities, quality of tool calls, and accuracy of output when integrated with Spice.

For more details on how model grades are evaluated in Spice, refer to the model grading criteria.

ModelSpice GradeModel ProviderContext WindowMax Output TokensChat CompletionResponse FormatToolsRecursive Tool CallReasoningStreaming ResponseEvaluation DateSpice Version
o3-mini-2025-01-31 (Reasoning effort: high)Aopenai200k tokens100k tokens2025-01-31v1.0.2
o3-mini-2025-01-31 (Reasoning effort: medium)Bopenai200k tokens100k tokens2025-01-31v1.0.2
o3-mini-2025-01-31 (Reasoning effort: low)Copenai200k tokens100k tokens2025-01-31v1.0.2
o1-2024-12-17 (Reasoning effort: high)Aopenai200k tokens100k tokens2024-12-17v1.0.2
o1-2024-12-17 (Reasoning effort: medium)Aopenai200k tokens100k tokens2024-12-17v1.0.2
o1-2024-12-17 (Reasoning effort: low)Copenai200k tokens100k tokens2024-12-17v1.0.2
gpt-4o-2024-08-06Bopenai128k tokens16384 tokens2024-08-06v1.0.2
claude-3-5-sonnet-20241022Canthropic200k tokens8192 tokens2024-10-22v1.0.2
grok-2-1212Ungradedxai128k tokensNot AvailableNot Availablev1.0.2
deepseek-ai/DeepSeek-R1-Distill-Llama-8BUngradedhuggingface128k tokensNot AvailableNot Availablev1.0.2
meta-llama/Llama-3.2-3B-InstructUngradedhuggingface128k tokensNot AvailableNot Availablev1.0.2