capreolus.collection

Package Contents

Classes

Collection(config=None, provide=None, share_dependency_objects=False, build=True) Base class for profane modules.
Robust04(config=None, provide=None, share_dependency_objects=False, build=True) Base class for profane modules.
DummyCollection(config=None, provide=None, share_dependency_objects=False, build=True) Base class for profane modules.
ANTIQUE(config=None, provide=None, share_dependency_objects=False, build=True) Base class for profane modules.
MSMarco(config=None, provide=None, share_dependency_objects=False, build=True) Base class for profane modules.
CodeSearchNet(config=None, provide=None, share_dependency_objects=False, build=True) Base class for profane modules.
COVID(config=None, provide=None, share_dependency_objects=False, build=True) Base class for profane modules.
capreolus.collection.logger[source]
capreolus.collection.PACKAGE_PATH[source]
class capreolus.collection.Collection(config=None, provide=None, share_dependency_objects=False, build=True)[source]

Bases: profane.ModuleBase

Base class for profane modules. Module construction proceeds as follows: 1) Any config options not present in config are filled in with their default values. Config options and their defaults are specified in the config_spec class attribute. 2) Any dependencies declared in the dependencies class attribute are recursively instantiated. If the dependency object is present in provide, this object will be used instead of instantiating a new object for the dependency. 3) The module object’s config variable is updated to reflect the configs of its dependencies and then frozen.

After construction is complete, the module’s dependencies are available as instance variables: self.`dependency key`.

Parameters:
  • config – dictionary containing a config to apply to this module and its dependencies
  • provide – dictionary mapping dependency keys to module objects
  • share_dependency_objects – if true, dependencies will be cached in the registry based on their configs and reused. See the share_objects argument of ModuleBase.create.
module_type = collection[source]
is_large_collection = False[source]
get_path_and_types(self)[source]
validate_document_path(self, path)[source]

Attempt to validate the document collection at path.

By default, this will only check whether path exists. Subclasses should override _validate_document_path(path) with their own logic to perform more detailed checks.

Returns:True if the path is valid following the logic described above, or False if it is not
find_document_path(self)[source]

Find the location of this collection’s documents (i.e., the raw document collection).

We first check the collection’s config for a path key. If found, self.validate_document_path checks whether the path is valid. Subclasses should override the private method self._validate_document_path with custom logic for performing checks further than existence of the directory. See Robust04.

If a valid path was not found, call download_if_missing. Subclasses should override this method if downloading the needed documents is possible.

If a valid document path cannot be found, an exception is thrown.

Returns:path to this collection’s raw documents
download_if_missing(self)[source]
class capreolus.collection.Robust04(config=None, provide=None, share_dependency_objects=False, build=True)[source]

Bases: capreolus.collection.Collection

Base class for profane modules. Module construction proceeds as follows: 1) Any config options not present in config are filled in with their default values. Config options and their defaults are specified in the config_spec class attribute. 2) Any dependencies declared in the dependencies class attribute are recursively instantiated. If the dependency object is present in provide, this object will be used instead of instantiating a new object for the dependency. 3) The module object’s config variable is updated to reflect the configs of its dependencies and then frozen.

After construction is complete, the module’s dependencies are available as instance variables: self.`dependency key`.

Parameters:
  • config – dictionary containing a config to apply to this module and its dependencies
  • provide – dictionary mapping dependency keys to module objects
  • share_dependency_objects – if true, dependencies will be cached in the registry based on their configs and reused. See the share_objects argument of ModuleBase.create.
module_name = robust04[source]
collection_type = TrecCollection[source]
generator_type = DefaultLuceneDocumentGenerator[source]
config_keys_not_in_path = ['path'][source]
config_spec[source]
download_if_missing(self)[source]
download_index(self, cachedir, url, sha256, index_directory_inside, index_cache_path_string, index_expected_document_count)[source]
class capreolus.collection.DummyCollection(config=None, provide=None, share_dependency_objects=False, build=True)[source]

Bases: capreolus.collection.Collection

Base class for profane modules. Module construction proceeds as follows: 1) Any config options not present in config are filled in with their default values. Config options and their defaults are specified in the config_spec class attribute. 2) Any dependencies declared in the dependencies class attribute are recursively instantiated. If the dependency object is present in provide, this object will be used instead of instantiating a new object for the dependency. 3) The module object’s config variable is updated to reflect the configs of its dependencies and then frozen.

After construction is complete, the module’s dependencies are available as instance variables: self.`dependency key`.

Parameters:
  • config – dictionary containing a config to apply to this module and its dependencies
  • provide – dictionary mapping dependency keys to module objects
  • share_dependency_objects – if true, dependencies will be cached in the registry based on their configs and reused. See the share_objects argument of ModuleBase.create.
module_name = dummy[source]
collection_type = TrecCollection[source]
generator_type = DefaultLuceneDocumentGenerator[source]
class capreolus.collection.ANTIQUE(config=None, provide=None, share_dependency_objects=False, build=True)[source]

Bases: capreolus.collection.Collection

Base class for profane modules. Module construction proceeds as follows: 1) Any config options not present in config are filled in with their default values. Config options and their defaults are specified in the config_spec class attribute. 2) Any dependencies declared in the dependencies class attribute are recursively instantiated. If the dependency object is present in provide, this object will be used instead of instantiating a new object for the dependency. 3) The module object’s config variable is updated to reflect the configs of its dependencies and then frozen.

After construction is complete, the module’s dependencies are available as instance variables: self.`dependency key`.

Parameters:
  • config – dictionary containing a config to apply to this module and its dependencies
  • provide – dictionary mapping dependency keys to module objects
  • share_dependency_objects – if true, dependencies will be cached in the registry based on their configs and reused. See the share_objects argument of ModuleBase.create.
module_name = antique[source]
collection_type = TrecCollection[source]
generator_type = DefaultLuceneDocumentGenerator[source]
download_if_missing(self)[source]
class capreolus.collection.MSMarco(config=None, provide=None, share_dependency_objects=False, build=True)[source]

Bases: capreolus.collection.Collection

Base class for profane modules. Module construction proceeds as follows: 1) Any config options not present in config are filled in with their default values. Config options and their defaults are specified in the config_spec class attribute. 2) Any dependencies declared in the dependencies class attribute are recursively instantiated. If the dependency object is present in provide, this object will be used instead of instantiating a new object for the dependency. 3) The module object’s config variable is updated to reflect the configs of its dependencies and then frozen.

After construction is complete, the module’s dependencies are available as instance variables: self.`dependency key`.

Parameters:
  • config – dictionary containing a config to apply to this module and its dependencies
  • provide – dictionary mapping dependency keys to module objects
  • share_dependency_objects – if true, dependencies will be cached in the registry based on their configs and reused. See the share_objects argument of ModuleBase.create.
module_name = msmarco[source]
config_keys_not_in_path = ['path'][source]
collection_type = TrecCollection[source]
generator_type = DefaultLuceneDocumentGenerator[source]
config_spec[source]
class capreolus.collection.CodeSearchNet(config=None, provide=None, share_dependency_objects=False, build=True)[source]

Bases: capreolus.collection.Collection

Base class for profane modules. Module construction proceeds as follows: 1) Any config options not present in config are filled in with their default values. Config options and their defaults are specified in the config_spec class attribute. 2) Any dependencies declared in the dependencies class attribute are recursively instantiated. If the dependency object is present in provide, this object will be used instead of instantiating a new object for the dependency. 3) The module object’s config variable is updated to reflect the configs of its dependencies and then frozen.

After construction is complete, the module’s dependencies are available as instance variables: self.`dependency key`.

Parameters:
  • config – dictionary containing a config to apply to this module and its dependencies
  • provide – dictionary mapping dependency keys to module objects
  • share_dependency_objects – if true, dependencies will be cached in the registry based on their configs and reused. See the share_objects argument of ModuleBase.create.
module_name = codesearchnet[source]
url = https://s3.amazonaws.com/code-search-net/CodeSearchNet/v2[source]
collection_type = TrecCollection[source]
generator_type = DefaultLuceneDocumentGenerator[source]
config_spec[source]
download_if_missing(self)[source]
class capreolus.collection.COVID(config=None, provide=None, share_dependency_objects=False, build=True)[source]

Bases: capreolus.collection.Collection

Base class for profane modules. Module construction proceeds as follows: 1) Any config options not present in config are filled in with their default values. Config options and their defaults are specified in the config_spec class attribute. 2) Any dependencies declared in the dependencies class attribute are recursively instantiated. If the dependency object is present in provide, this object will be used instead of instantiating a new object for the dependency. 3) The module object’s config variable is updated to reflect the configs of its dependencies and then frozen.

After construction is complete, the module’s dependencies are available as instance variables: self.`dependency key`.

Parameters:
  • config – dictionary containing a config to apply to this module and its dependencies
  • provide – dictionary mapping dependency keys to module objects
  • share_dependency_objects – if true, dependencies will be cached in the registry based on their configs and reused. See the share_objects argument of ModuleBase.create.
module_name = covid[source]
url = https://ai2-semanticscholar-cord-19.s3-us-west-2.amazonaws.com/historical_releases/cord-19_%s.tar.gz[source]
generator_type = Cord19Generator[source]
config_spec[source]
build(self)[source]
download_if_missing(self)[source]
transform_metadata(self, root_path)[source]

the transformation is necessary for dataset round 1 and 2 according to https://discourse.cord-19.semanticscholar.org/t/faqs-about-cord-19-dataset/94

the assumed directory under root_path: ./root_path

./metadata.csv ./comm_use_subset ./noncomm_use_subset ./custom_license ./biorxiv_medrxiv ./archive

In a nutshell: 1. renaming:

Microsoft Academic Paper ID -> mag_id; WHO #Covidence -> who_covidence_id
  1. update:
    has_pdf_parse -> pdf_json_files # e.g. document_parses/pmc_json/PMC125340.xml.json has_pmc_xml_parse -> pmc_json_files