About 1,260,000 results
Open links in new tab
  1. OpenAI Evals - GitHub

    If you think you have an interesting eval, please open a pull request with your contribution. OpenAI staff actively review these evals when considering improvements to upcoming models.

  2. GitHub - microsoft/eval-recipes

    Sep 5, 2025 · Eval Recipes provides a benchmarking harness for evaluating AI agents on real-world tasks in isolated Docker containers. We have a few sample tasks ranging from creating …

  3. GitHub - openai/simple-evals

    Adding new rows to the table below with eval results, given new models and new system prompts. This repository is NOT intended as a replacement for https://github.com/openai/evals, which is …

  4. GitHub - lucemia/evals-llama: Evals is a framework for evaluating ...

    With Evals, we aim to make it as simple as possible to build an eval while writing as little code as possible. To get started, we recommend that you follow these steps in order:

  5. EvalPlus( ) => - GitHub

    EvalPlus is a rigorous evaluation framework for LLM4Code, with: HumanEval+: 80x more tests than the original HumanEval! MBPP+: 35x more tests than the original MBPP! EvalPerf: …

  6. GitHub - microsoft/EvalsforAgentsInterop: Redefine agentic …

    Redefine agentic evaluation in enterprise AI. By simulating realistic scenarios, these evals enables rigorous, multi-dimensional assessment of LLM-powered productivity agents. - …

  7. GitHub - instructlab/eval: Python library for Evaluation

    Python library for Evaluation. Contribute to instructlab/eval development by creating an account on GitHub.

  8. GitHub - openeval/eval

    Overview Eval is an open source platform designed to revolutionize the way companies assess technical candidates. By leveraging real-world open source issues, the platform provides a …

  9. VoQA/eval at main · AJN-AI/VoQA · GitHub

    VoQA Benchmark is a comprehensive benchmark for Visual-only Question Answering (VoQA) that provides a unified evaluation framework for both open-source and closed-source models. …

  10. chirag127/LLM-Code-Evaluation-Framework - GitHub

    Framework for evaluating Large Language Models (LLMs) trained on code, based on the HumanEval benchmark. Supports automated testing and performance analysis. - …