capreolus.collection

Package Contents

Classes

Collection

Base class for Collection modules. The purpose of a Collection is to describe a document collection’s location and its format.

IRDCollection

Base class for collections supported by ir_datasets

class capreolus.collection.Collection(config=None, provide=None, share_dependency_objects=False, build=True)[source]

Bases: capreolus.ModuleBase

Base class for Collection modules. The purpose of a Collection is to describe a document collection’s location and its format.

Determining the document collection’s location on disk:
  • The path config option will be used if it contains a valid loation.

  • If not, the _path attribute is used if it is valid. This is primarily used with DummyCollection.

  • If not, the class’ download_if_missing method will be called.

Modules should provide:
  • the collection_type and generator_type class attributes, corresponding to Anserini types

  • a download_if_missing method, if the collection is publicly available

  • a _validate_document_path method. See validate_document_path().

module_type = collection[source]
is_large_collection = False[source]
get_path_and_types(self)[source]

Returns a (path, collection_type, generator_type) tuple.

validate_document_path(self, path)[source]

Attempt to validate the document collection at path.

By default, this will only check whether path exists. Subclasses should override _validate_document_path(path) with their own logic to perform more detailed checks.

Returns

True if the path is valid following the logic described above, or False if it is not

find_document_path(self)[source]

Find the location of this collection’s documents (i.e., the raw document collection).

We first check the collection’s config for a path key. If found, self.validate_document_path checks whether the path is valid. Subclasses should override the private method self._validate_document_path with custom logic for performing checks further than existence of the directory.

If a valid path was not found, call download_if_missing. Subclasses should override this method if downloading the needed documents is possible.

If a valid document path cannot be found, an exception is thrown.

Returns

path to this collection’s raw documents

download_if_missing(self)[source]

Download the collection and return its path. Subclasses should override this.

class capreolus.collection.IRDCollection(config=None, provide=None, share_dependency_objects=False, build=True)[source]

Bases: capreolus.collection.Collection

Base class for collections supported by ir_datasets

ird_dataset_name[source]
generator_type = DefaultLuceneDocumentGenerator[source]
property dataset(self)[source]
download_if_missing(self)[source]

Download the collection and return its path. Subclasses should override this.

doc_as_json(self, doc)[source]