||Base class for Collection modules. The purpose of a Collection is to describe a document collection’s location and its format.|
Collection(config=None, provide=None, share_dependency_objects=False, build=True)¶
Base class for Collection modules. The purpose of a Collection is to describe a document collection’s location and its format.
- Determining the document collection’s location on disk:
- The path config option will be used if it contains a valid loation.
- If not, the
_pathattribute is used if it is valid. This is primarily used with
- If not, the class’
download_if_missingmethod will be called.
- Modules should provide:
generator_typeclass attributes, corresponding to Anserini types
download_if_missingmethod, if the collection is publicly available
(path, collection_type, generator_type)tuple.
Attempt to validate the document collection at
By default, this will only check whether
pathexists. Subclasses should override
_validate_document_path(path)with their own logic to perform more detailed checks.
Returns: True if the path is valid following the logic described above, or False if it is not
Find the location of this collection’s documents (i.e., the raw document collection).
We first check the collection’s config for a path key. If found,
self.validate_document_pathchecks whether the path is valid. Subclasses should override the private method
self._validate_document_pathwith custom logic for performing checks further than existence of the directory.
If a valid path was not found, call
download_if_missing. Subclasses should override this method if downloading the needed documents is possible.
If a valid document path cannot be found, an exception is thrown.
Returns: path to this collection’s raw documents
Download the collection and return its path. Subclasses should override this.