capreolus.collection
¶
Submodules¶
Package Contents¶
Classes¶
Collection |
Base class for Collection modules. The purpose of a Collection is to describe a document collection’s location and its format. |
-
class
capreolus.collection.
Collection
(config=None, provide=None, share_dependency_objects=False, build=True)[source]¶ Bases:
capreolus.ModuleBase
Base class for Collection modules. The purpose of a Collection is to describe a document collection’s location and its format.
- Determining the document collection’s location on disk:
- The path config option will be used if it contains a valid loation.
- If not, the
_path
attribute is used if it is valid. This is primarily used withDummyCollection
. - If not, the class’
download_if_missing
method will be called.
- Modules should provide:
- the
collection_type
andgenerator_type
class attributes, corresponding to Anserini types - a
download_if_missing
method, if the collection is publicly available - a
_validate_document_path
method. Seevalidate_document_path()
.
- the
-
validate_document_path
(self, path)[source]¶ Attempt to validate the document collection at
path
.By default, this will only check whether
path
exists. Subclasses should override_validate_document_path(path)
with their own logic to perform more detailed checks.Returns: True if the path is valid following the logic described above, or False if it is not
-
find_document_path
(self)[source]¶ Find the location of this collection’s documents (i.e., the raw document collection).
We first check the collection’s config for a path key. If found,
self.validate_document_path
checks whether the path is valid. Subclasses should override the private methodself._validate_document_path
with custom logic for performing checks further than existence of the directory.If a valid path was not found, call
download_if_missing
. Subclasses should override this method if downloading the needed documents is possible.If a valid document path cannot be found, an exception is thrown.
Returns: path to this collection’s raw documents