Models Grade Report
This document presents the evaluation report for various Large-Language-Models (LLMs) graded by Spice AI. The models are assessed based on their basic capabilities, quality of tool calls, and accuracy of output when integrated with Spice.
For more details on how model grades are evaluated in Spice, refer to the model grading criteria.
| Model | Spice Grade | Model Provider | Context Window | Max Output Tokens | Chat Completion | Response Format | Tools | Recursive Tool Call | Reasoning | Streaming Response | Evaluation Date | Spice Version |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
o3-mini-2025-01-31 (Reasoning effort: high) | A | openai | 200k tokens | 100k tokens | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | 2025-01-31 | v1.0.2 |
o3-mini-2025-01-31 (Reasoning effort: medium) | B | openai | 200k tokens | 100k tokens | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | 2025-01-31 | v1.0.2 |
o3-mini-2025-01-31 (Reasoning effort: low) | C | openai | 200k tokens | 100k tokens | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | 2025-01-31 | v1.0.2 |
o1-2024-12-17 (Reasoning effort: high) | A | openai | 200k tokens | 100k tokens | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | 2024-12-17 | v1.0.2 |
o1-2024-12-17 (Reasoning effort: medium) | A | openai | 200k tokens | 100k tokens | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | 2024-12-17 | v1.0.2 |
o1-2024-12-17 (Reasoning effort: low) | C | openai | 200k tokens | 100k tokens | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | 2024-12-17 | v1.0.2 |
gpt-4o-2024-08-06 | B | openai | 128k tokens | 16384 tokens | ✅ | ✅ | ✅ |