`capreolus.evaluator`¶

Module Contents¶

`judged`(qrels, runs, n)
`eval_runs`(runs, qrels, metrics, relevance_level=1)	Evaluate runs produced by a ranker (or loaded with Searcher.load_trec_run)
`eval_runfile`(runfile, qrels, metrics, relevance_level)	Evaluate a single runfile produced by ranker or reranker
`search_best_run`(runfile_dirs, benchmark, primary_metric, metrics=None, folds=None)	Select the runfile with respect to the specified metric
`interpolate_runs`(run1, run2, qids, alpha)
`interpolated_eval`(run1, run2, benchmark, primary_metric, metrics=None)

capreolus.evaluator.DEFAULT_METRICS = ['P_1', 'P_5', 'P_10', 'P_20', 'judged_10', 'judged_20', 'judged_200', 'map', 'ndcg_cut_5', 'ndcg_cut_10', 'ndcg_cut_20', 'recall_100', 'recall_1000', 'recip_rank'][source]¶

capreolus.evaluator.eval_runs(runs, qrels, metrics, relevance_level=1)[source]¶

Evaluate runs produced by a ranker (or loaded with Searcher.load_trec_run)

Parameters:	runs – dict in the format `{qid: {docid: score}}` qrels – dict containing relevance judgements (e.g., `benchmark.qrels`) metrics (str or list) – metrics to calculate (e.g., `evaluator.DEFAULT_METRICS`) relevance_level (int) – relevance label threshold to use with non-graded metrics (equivalent to trec_eval’s –level_for_rel)
Returns:	a dict in the format `{metric: score}` containing the average score for each metric
Return type:	dict

capreolus.evaluator.eval_runfile(runfile, qrels, metrics, relevance_level)[source]¶

Evaluate a single runfile produced by ranker or reranker

Parameters:	runfile – str, path to runfile qrels – dict, containing the judgements provided by benchmark metrics – str or list, metrics expected to calculate, e.g. ndcg_cut_20, etc
Returns:	score}, containing the evaluation score of specified metrics
Return type:	a dict with format {metric

capreolus.evaluator.search_best_run(runfile_dirs, benchmark, primary_metric, metrics=None, folds=None)[source]¶

Select the runfile with respect to the specified metric

Parameters:	runfile_dirs – the directory path to all the runfiles to select from benchmark – Benchmark class primary_metric – str, metric used to select the best runfile , e.g. ndcg_cut_20, etc metrics – str or list, metric expected by be calculated on the best runs folds – str, the name of fold to select from
Returns:	a dict storing specified metric score and path to the corresponding runfile

capreolus.evaluator.interpolate_runs(run1, run2, qids, alpha)[source]¶

capreolus.evaluator.interpolated_eval(run1, run2, benchmark, primary_metric, metrics=None)[source]¶