capreolus.collection

Package Contents

Classes

Collection Base class for Collection modules. The purpose of a Collection is to describe a document collection’s location and its format.
class capreolus.collection.Collection(config=None, provide=None, share_dependency_objects=False, build=True)[source]

Bases: capreolus.ModuleBase

Base class for Collection modules. The purpose of a Collection is to describe a document collection’s location and its format.

Determining the document collection’s location on disk:
  • The path config option will be used if it contains a valid loation.
  • If not, the _path attribute is used if it is valid. This is primarily used with DummyCollection.
  • If not, the class’ download_if_missing method will be called.
Modules should provide:
  • the collection_type and generator_type class attributes, corresponding to Anserini types
  • a download_if_missing method, if the collection is publicly available
  • a _validate_document_path method. See validate_document_path().
module_type = collection[source]
is_large_collection = False[source]
get_path_and_types(self)[source]

Returns a (path, collection_type, generator_type) tuple.

validate_document_path(self, path)[source]

Attempt to validate the document collection at path.

By default, this will only check whether path exists. Subclasses should override _validate_document_path(path) with their own logic to perform more detailed checks.

Returns:True if the path is valid following the logic described above, or False if it is not
find_document_path(self)[source]

Find the location of this collection’s documents (i.e., the raw document collection).

We first check the collection’s config for a path key. If found, self.validate_document_path checks whether the path is valid. Subclasses should override the private method self._validate_document_path with custom logic for performing checks further than existence of the directory.

If a valid path was not found, call download_if_missing. Subclasses should override this method if downloading the needed documents is possible.

If a valid document path cannot be found, an exception is thrown.

Returns:path to this collection’s raw documents
download_if_missing(self)[source]

Download the collection and return its path. Subclasses should override this.