Benchmarks

A Benchmark runs an experiment on a BrainModel and tests the resulting measurements against primate data. This comparison is done by a Metric which outputs a score of how well model and data match. This score is normalized with data ceilings and the benchmark returns this ceiled score.

class brainscore_core.benchmarks.Benchmark[source]

Standard Benchmark interface defining the method interfaces.

__call__(candidate)[source]

Evaluate a candidate BrainModel and return a Score denoting the alignment of the model to natural intelligence measurements under this benchmark. Typically this involves reproducing the experiment on the model and then comparing model measurements (e.g. neural/behavioral) against experimental recordings from biological subjects (e.g. primates) using a Metric. The output of this method is a normalized score between 0 and 1 where 0 means the model does not match the measurements at all and 1 means the model matches the measurements at ceiling level (e.g. if the model obtains a score of 0.8 and the data ceiling is also 0.8, the score output by this method should be 1).

Parameters:: candidate – a candidate model implementing the BrainModel interface. Benchmarks are agnostic of the exact implementation and only interact with models through the methods defined in the interface.
Returns:: a Score of how aligned to natural intelligence the candidate model is under this benchmark. The score is normalized by this benchmark’s ceiling such that 1 means the model matches the data to ceiling level.

property bibtex: str: bibtex string to build the reference. Should include an url to build a proper link.

property ceiling: Score

The ceiling of this benchmark. Scores need to be normalized by this value. Typically this represents the signal in the data and how well we expect the best possible model to score.

Returns:: a Score object, denoting the ceiling of this benchmark. Typically has two values indexed by an aggregation coordinate: center for the averaged ceiling value, and error for the uncertainty.

property identifier: str

Unique identifier for this benchmark. Standard format is <data identifier>-<metric identifier>, e.g. dicarlo.Rajalingham2018-i2n.

Returns:: a unique identifier for this benchmark

property parent: str

The identifier for the parent of this benchmark. Typically this is one of behavioral, neural, engineering.

For benchmarks composed of sub-benchmarks, the sub-benchmark’s parent can also be an aggregate benchmark identifier; for instance the sub-benchmarks ‘Geirhos2021colour-error_consistency’ and ‘Geirhos2021contrast-error_consistency’ might have ‘Geirhos2021-error_consistency’ as parent.

preallocate_memory(candidate) → None[source]

Optional pre-flight memory check run before __call__().

Domain-specific subclasses (e.g. NeuralBenchmark in brainscore_vision) can override this to probe the candidate model with a single stimulus and raise MemoryError early if the full benchmark run is estimated to exceed available RAM — rather than discovering the OOM 6+ hours in.

The default implementation is a no-op so that existing benchmarks that do not override this method are unaffected.

Parameters:: candidate – the candidate model that will be passed to __call__().

property version

Returns:: a version number that is increased every time the model scores for this benchmark change (but not for code changes that do not change scores).

class brainscore_core.benchmarks.BenchmarkBase(identifier: str, ceiling: Score, version, parent: str, bibtex: str = None)[source]

Helper class for implementing standard functions of the Benchmark interface.

__init__(identifier: str, ceiling: Score, version, parent: str, bibtex: str = None)[source]

property bibtex: bibtex string to build the reference. Should include an url to build a proper link.

property ceiling

The ceiling of this benchmark. Scores need to be normalized by this value. Typically this represents the signal in the data and how well we expect the best possible model to score.

Returns:: a Score object, denoting the ceiling of this benchmark. Typically has two values indexed by an aggregation coordinate: center for the averaged ceiling value, and error for the uncertainty.

property identifier

Unique identifier for this benchmark. Standard format is <data identifier>-<metric identifier>, e.g. dicarlo.Rajalingham2018-i2n.

Returns:: a unique identifier for this benchmark

property parent

The identifier for the parent of this benchmark. Typically this is one of behavioral, neural, engineering.

For benchmarks composed of sub-benchmarks, the sub-benchmark’s parent can also be an aggregate benchmark identifier; for instance the sub-benchmarks ‘Geirhos2021colour-error_consistency’ and ‘Geirhos2021contrast-error_consistency’ might have ‘Geirhos2021-error_consistency’ as parent.

property version

Returns:: a version number that is increased every time the model scores for this benchmark change (but not for code changes that do not change scores).

brainscore_core.benchmarks.ceil_score(score, ceiling)[source]

brainscore_core.benchmarks.score_benchmark(benchmark: Benchmark, candidate) → Score[source]

Score candidate on benchmark, running a pre-flight memory check first.

This is the recommended call site instead of calling benchmark(candidate) directly. It calls preallocate_memory() before the actual scoring so that domain-specific subclasses can raise MemoryError early when the run is estimated to exceed available RAM.

Parameters:

benchmark – a Benchmark instance to evaluate the candidate on.
candidate – the candidate model implementing the domain’s BrainModel interface.

Returns:

the Score returned by the benchmark.