Getting Started

  • Requirements: Python 3.6+, a Python environment you can install packages in (e.g., virtualenv), and Java 11. See the detailed installation instructions for help with these.
  • Install: pip install capreolus


Results and cached objects are stored in ~/.capreolus/results/ and ~/.capreolus/cache/ by default. Set the CAPREOLUS_RESULTS and CAPREOLUS_CACHE environment variables to change these locations. For example: export CAPREOLUS_CACHE=/data/capreolus/cache

Command Line Interface

Use the RankTask pipeline to rank documents using a Searcher on an Anserini Index built on NFCorpus, which contains biomedical documents and queries. NFCorpus was published by Boteva et al. in ECIR 2016. This dataset is publicly available and will be automatically downloaded by Capreolus.

$ capreolus rank.searcheval with \ searcher.index.stemmer=porter searcher.b=0.8

The searcheval command instructs RankTask to query NFCorpus and evaluate the Searcher’s performance on NFCorpus’ test queries. The command will output results like this:

INFO - capreolus.task.rank.evaluate - rank: fold=s1 best run: ...searcher-BM25_b-0.8_fields-title_hits-1000_k1-0.9/task-rank_filter-False/searcher
INFO - capreolus.task.rank.evaluate - rank: cross-validated results when optimizing for 'map':
INFO - capreolus.task.rank.evaluate -             map: 0.1520
INFO - capreolus.task.rank.evaluate -     ndcg_cut_10: 0.3247

These results are comparable with the all titles results in the NFCorpus paper, which reports a MAP of 0.1251 for BM25 (Table 2). The Benchmark’s fields config option can be used to issue other types of queries as well (e.g., benchmark.fields=all_fields).


Capreolus Benchmarks define folds to use; each fold specifies training, dev (validation), and test queries. Tasks respect these folds when calculating metrics. NFCorpus defines a fixed test set, which corresponds to having a single fold in Capreolus. When running a benchmark that uses multiple folds with cross-validation, like robust04, the results reported are averaged over the benchmark’s test sets.

Python API

Let’s run the same pipeline using the Python API:

from capreolus.task import RankTask

task = RankTask({'searcher': {'name': 'BM25', 'index': {'stemmer': 'porter'}, 'b': '0.8'},
                 'benchmark': {'name': 'nf'}})


The capreolus.parse_config_string convenience method can transform a config string like into a config dict as shown above.

Capreolus pipelines are composed of self-contained modules corresponding to "IR primitives", which can also be used individually. Each module declares any module dependencies it needs to perform its function. The pipeline itself, which can be viewed as a dependency graph, is represented by a Task module.

RankTask declares dependencies on a Searcher module and a Benchmark module, which it uses to query a document collection and to obtain experimental data (i.e., topics, relevance judgments, and folds), respectively. The Searcher depends on an Index. Both the Index and Benchmark depend on a Collection. In this example, RankTask requires that the same Collection be provided to both.

from capreolus import Benchmark, Collection, Index, Searcher

Let’s construct this graph one module at a time.

# Previously, the Benchmark specified a dependency on the 'nf' collection specifically.
# Now we create this Collection directly.
>>> collection = Collection.create("nf")
>>> collection.get_path_and_types()
    ("/path/to/collection-nf/documents", "TrecCollection", "DefaultLuceneDocumentGenerator")
# Next, create a Benchmark and pass it the collection object directly.
# This is an alternative to automatically creating the collection as a dependency.
>>> benchmark = Benchmark.create("nf", provide={'collection': collection})
>>> benchmark.topics["title"]
    {'56': 'foods for glaucoma', '68': 'what is actually in chicken nuggets', ... }

Next, we can build Index and Searcher. These module types do more than just pointing to data.

>>> index = Index.create("anserini", {"stemmer": "porter"}, provide={"collection": collection})
>>> index.create_index()  # returns immediately if the index already exists
>>> index.get_df("foods")
>>> index.get_df("food")
# Next, a Searcher to query the index
>>> searcher = Searcher.create("BM25", {"hits": 3}, provide={"index": index})
>>> searcher.query("foods")
OrderedDict([('MED-1761', 1.213), 
             ('MED-2742', 1.212),
             ('MED-1046', 1.2058)])

Finally, we can emulate the method we called earlier:

>>> results = {}
>>> for qid, topic in benchmark.topics['title'].items():
        results[qid] = searcher.query(topic)

To get metrics, we could then pass results to capreolus.evaluator.eval_runs():

capreolus.evaluator.eval_runs(runs, qrels, metrics, relevance_level=1)[source]

Evaluate runs produced by a ranker (or loaded with Searcher.load_trec_run)

  • runs – dict in the format {qid: {docid: score}}
  • qrels – dict containing relevance judgements (e.g., benchmark.qrels)
  • metrics (str or list) – metrics to calculate (e.g., evaluator.DEFAULT_METRICS)
  • relevance_level (int) – relevance label threshold to use with non-graded metrics (equivalent to trec_eval’s –level_for_rel)

a dict in the format {metric: score} containing the average score for each metric

Return type:


Creating New Modules

Capreolus modules implement the Capreolus module API plus an API specific to the module type. The module API consists of four attributes:

  • module_type: a string indicating the module’s type, like “index” or “benchmark”
  • module_name: a string indicating the module’s name, like “anserini” or “nf”
  • config_spec: a list of ConfigOption objects. For example, [ConfigOption("stemmer", default_value="none", description="stemmer to use")]
  • dependencies a list of Dependency objects. For example, [Dependency(key="collection", module="collection", name="nf")]

When the module is created, any dependencies that are not explicitly passed with provide={key: object} are automatically created. The module’s config options in config_spec and those of its dependencies are exposed as Capreolus configuration options.

Task API

The Task module API specifies two additional class attributes: commands and default_command. These specify the functions that should serve as the Task’s entrypoints and the default entrypoint, respectively.

Let’s create a new task that mirrors the graph we constructed manually, except with two separate Searcher objects. We’ll save the results from both searchers and measure their effectiveness on the validation queries to decide which searcher to report test set results on.

from capreolus import evaluator, get_logger, Dependency, ConfigOption
from capreolus.task import Task

logger = get_logger(__name__)  # pylint: disable=invalid-name

class TutorialTask(Task):
    module_name = "tutorial"
    config_spec = [ConfigOption("optimize", "map", "metric to maximize on the validation set")]
    dependencies = [
            key="benchmark", module="benchmark", name="nf", provide_this=True, provide_children=["collection"]
        Dependency(key="searcher1", module="searcher", name="BM25RM3"),
        Dependency(key="searcher2", module="searcher", name="SDM"),

    commands = ["run"] + Task.help_commands
    default_command = "run"

    def run(self):
        output_dir = self.get_results_path()

        # read the title queries from the chosen benchmark's topic file
        results1 = self.searcher1.query_from_file(self.benchmark.topic_file, output_dir / "searcher1")
        results2 = self.searcher2.query_from_file(self.benchmark.topic_file, output_dir / "searcher2")
        searcher_results = [results1, results2]

        # using the benchmark's folds, which each contain train/validation/test queries,
        # choose the best run in `output_dir` for the fold based on the validation queries
        # and return metrics calculated on the test queries
        best_results = evaluator.search_best_run(
            searcher_results, self.benchmark, primary_metric=self.config["optimize"], metrics=evaluator.DEFAULT_METRICS

        for fold, path in best_results["path"].items():
            shortpath = "..." + path[:-20]
  "fold=%s best run: %s", fold, shortpath)"cross-validated results when optimizing for '%s':", self.config["optimize"])
        for metric, score in sorted(best_results["score"].items()):
  "%15s: %0.4f", metric, score)

        return best_results


The module needs to be registered in order for Capreolus to find it. Registration happens when the @Task.register decorator is applied, so no additional steps are needed to use the new Task via the Python API. When using the Task via the CLI, the file containing it needs to be imported in order for the Task to be registered. This can be accomplished by placing the file inside the capreolus.tasks package (see capreolus.task.__path__). However, in this case, the above Task is already provided with Capreolus as tasks/

Let’s try running the Task we just declared via the Python API.

>>> task = TutorialTask()
>>> results =
>>> results['score']['map']
# looks like we got an improvement! which run was better?
>>> results['path']

Module APIs

Each module type’s base class describes the module API that should be implemented to create new modules of that type. Check out the API documentation to learn more: Benchmark, Collection, Extractor, Index, Reranker, Searcher, Task, Tokenizer, and Trainer.

Next Steps