

Package Contents



Base class for Collection modules. The purpose of a Collection is to describe a document collection's location and its format.


Base class for collections supported by ir_datasets

class capreolus.collection.Collection(config=None, provide=None, share_dependency_objects=False, build=True)[source]

Bases: capreolus.ModuleBase

Base class for Collection modules. The purpose of a Collection is to describe a document collection’s location and its format.

Determining the document collection’s location on disk:
  • The path config option will be used if it contains a valid loation.

  • If not, the _path attribute is used if it is valid. This is primarily used with DummyCollection.

  • If not, the class’ download_if_missing method will be called.

Modules should provide:
  • the collection_type and generator_type class attributes, corresponding to Anserini types

  • a download_if_missing method, if the collection is publicly available

  • a _validate_document_path method. See validate_document_path().

module_type = 'collection'[source]
is_large_collection = False[source]

Returns a (path, collection_type, generator_type) tuple.


Attempt to validate the document collection at path.

By default, this will only check whether path exists. Subclasses should override _validate_document_path(path) with their own logic to perform more detailed checks.


True if the path is valid following the logic described above, or False if it is not


Find the location of this collection’s documents (i.e., the raw document collection).

We first check the collection’s config for a path key. If found, self.validate_document_path checks whether the path is valid. Subclasses should override the private method self._validate_document_path with custom logic for performing checks further than existence of the directory.

If a valid path was not found, call download_if_missing. Subclasses should override this method if downloading the needed documents is possible.

If a valid document path cannot be found, an exception is thrown.


path to this collection’s raw documents


Download the collection and return its path. Subclasses should override this.

class capreolus.collection.IRDCollection(config=None, provide=None, share_dependency_objects=False, build=True)[source]

Bases: Collection

Base class for collections supported by ir_datasets

property dataset[source]
generator_type = 'DefaultLuceneDocumentGenerator'[source]

Download the collection and return its path. Subclasses should override this.
