capreolus.collection

Package Contents

Classes

Collection()
Robust04()
DummyCollection()
ANTIQUE()
MSMarco()
CodeSearchNet()
COVID()
capreolus.collection.logger[source]
capreolus.collection.PACKAGE_PATH[source]
class capreolus.collection.Collection[source]

Bases: profane.ModuleBase

module_type = collection[source]
is_large_collection = False[source]
get_path_and_types(self)[source]
validate_document_path(self, path)[source]

Attempt to validate the document collection at path.

By default, this will only check whether path exists. Subclasses should override _validate_document_path(path) with their own logic to perform more detailed checks.

Returns:True if the path is valid following the logic described above, or False if it is not
find_document_path(self)[source]

Find the location of this collection’s documents (i.e., the raw document collection).

We first check the collection’s config for a path key. If found, self.validate_document_path checks whether the path is valid. Subclasses should override the private method self._validate_document_path with custom logic for performing checks further than existence of the directory. See Robust04.

If a valid path was not found, call download_if_missing. Subclasses should override this method if downloading the needed documents is possible.

If a valid document path cannot be found, an exception is thrown.

Returns:path to this collection’s raw documents
download_if_missing(self)[source]
class capreolus.collection.Robust04[source]

Bases: capreolus.collection.Collection

module_name = robust04[source]
collection_type = TrecCollection[source]
generator_type = DefaultLuceneDocumentGenerator[source]
config_keys_not_in_path = ['path'][source]
config_spec[source]
download_if_missing(self)[source]
download_index(self, cachedir, url, sha256, index_directory_inside, index_cache_path_string, index_expected_document_count)[source]
class capreolus.collection.DummyCollection[source]

Bases: capreolus.collection.Collection

module_name = dummy[source]
collection_type = TrecCollection[source]
generator_type = DefaultLuceneDocumentGenerator[source]
class capreolus.collection.ANTIQUE[source]

Bases: capreolus.collection.Collection

module_name = antique[source]
collection_type = TrecCollection[source]
generator_type = DefaultLuceneDocumentGenerator[source]
download_if_missing(self)[source]
class capreolus.collection.MSMarco[source]

Bases: capreolus.collection.Collection

module_name = msmarco[source]
config_keys_not_in_path = ['path'][source]
collection_type = TrecCollection[source]
generator_type = DefaultLuceneDocumentGenerator[source]
config_spec[source]
class capreolus.collection.CodeSearchNet[source]

Bases: capreolus.collection.Collection

module_name = codesearchnet[source]
url = https://s3.amazonaws.com/code-search-net/CodeSearchNet/v2[source]
collection_type = TrecCollection[source]
generator_type = DefaultLuceneDocumentGenerator[source]
config_spec[source]
download_if_missing(self)[source]
class capreolus.collection.COVID[source]

Bases: capreolus.collection.Collection

module_name = covid[source]
url = https://ai2-semanticscholar-cord-19.s3-us-west-2.amazonaws.com/historical_releases/cord-19_%s.tar.gz[source]
generator_type = Cord19Generator[source]
config_spec[source]
build(self)[source]
download_if_missing(self)[source]
transform_metadata(self, root_path)[source]

the transformation is necessary for dataset round 1 and 2 according to https://discourse.cord-19.semanticscholar.org/t/faqs-about-cord-19-dataset/94

the assumed directory under root_path: ./root_path

./metadata.csv ./comm_use_subset ./noncomm_use_subset ./custom_license ./biorxiv_medrxiv ./archive

In a nutshell: 1. renaming:

Microsoft Academic Paper ID -> mag_id; WHO #Covidence -> who_covidence_id
  1. update:
    has_pdf_parse -> pdf_json_files # e.g. document_parses/pmc_json/PMC125340.xml.json has_pmc_xml_parse -> pmc_json_files