capreolus.collection
¶
Submodules¶
capreolus.collection.antique
capreolus.collection.cds
capreolus.collection.codesearchnet
capreolus.collection.covid
capreolus.collection.covidabstract
capreolus.collection.dummy
capreolus.collection.gov2
capreolus.collection.highwire
capreolus.collection.msmarco
capreolus.collection.nf
capreolus.collection.nyt
capreolus.collection.robust04
capreolus.collection.wapo
Package Contents¶
Classes¶
Base class for Collection modules. The purpose of a Collection is to describe a document collection's location and its format. |
|
Base class for collections supported by ir_datasets |
- class capreolus.collection.Collection(config=None, provide=None, share_dependency_objects=False, build=True)[source]¶
Bases:
capreolus.ModuleBase
Base class for Collection modules. The purpose of a Collection is to describe a document collection’s location and its format.
- Determining the document collection’s location on disk:
The path config option will be used if it contains a valid loation.
If not, the
_path
attribute is used if it is valid. This is primarily used withDummyCollection
.If not, the class’
download_if_missing
method will be called.
- Modules should provide:
the
collection_type
andgenerator_type
class attributes, corresponding to Anserini typesa
download_if_missing
method, if the collection is publicly availablea
_validate_document_path
method. Seevalidate_document_path()
.
- validate_document_path(path)[source]¶
Attempt to validate the document collection at
path
.By default, this will only check whether
path
exists. Subclasses should override_validate_document_path(path)
with their own logic to perform more detailed checks.- Returns
True if the path is valid following the logic described above, or False if it is not
- find_document_path()[source]¶
Find the location of this collection’s documents (i.e., the raw document collection).
We first check the collection’s config for a path key. If found,
self.validate_document_path
checks whether the path is valid. Subclasses should override the private methodself._validate_document_path
with custom logic for performing checks further than existence of the directory.If a valid path was not found, call
download_if_missing
. Subclasses should override this method if downloading the needed documents is possible.If a valid document path cannot be found, an exception is thrown.
- Returns
path to this collection’s raw documents
- class capreolus.collection.IRDCollection(config=None, provide=None, share_dependency_objects=False, build=True)[source]¶
Bases:
Collection
Base class for collections supported by ir_datasets