capreolus.collection.msmarco

Module Contents

Classes

MSMarcoMixin

MSMarcoPsg

Base class for Collection modules. The purpose of a Collection is to describe a document collection's location and its format.

Attributes

logger

capreolus.collection.msmarco.logger[source]
class capreolus.collection.msmarco.MSMarcoMixin[source]
static download_and_extract(url, tmp_dir, expected_fns=None)[source]
class capreolus.collection.msmarco.MSMarcoPsg(config=None, provide=None, share_dependency_objects=False, build=True)[source]

Bases: capreolus.collection.Collection, MSMarcoMixin

Base class for Collection modules. The purpose of a Collection is to describe a document collection’s location and its format.

Determining the document collection’s location on disk:
  • The path config option will be used if it contains a valid loation.

  • If not, the _path attribute is used if it is valid. This is primarily used with DummyCollection.

  • If not, the class’ download_if_missing method will be called.

Modules should provide:
  • the collection_type and generator_type class attributes, corresponding to Anserini types

  • a download_if_missing method, if the collection is publicly available

  • a _validate_document_path method. See validate_document_path().

module_name = 'msmarcopsg'[source]
collection_type = 'TrecCollection'[source]
generator_type = 'DefaultLuceneDocumentGenerator'[source]
download_raw()[source]
download_if_missing()[source]

Download the collection and return its path. Subclasses should override this.