:mod:`capreolus.collection` =========================== .. py:module:: capreolus.collection Package Contents ---------------- Classes ~~~~~~~ .. autoapisummary:: capreolus.collection.Collection capreolus.collection.Robust04 capreolus.collection.DummyCollection capreolus.collection.ANTIQUE capreolus.collection.MSMarco capreolus.collection.CodeSearchNet capreolus.collection.COVID .. data:: logger .. data:: PACKAGE_PATH .. py:class:: Collection Bases: :class:`profane.ModuleBase` .. attribute:: module_type :annotation: = collection .. attribute:: is_large_collection :annotation: = False .. method:: get_path_and_types(self) .. method:: validate_document_path(self, path) Attempt to validate the document collection at `path`. By default, this will only check whether `path` exists. Subclasses should override `_validate_document_path(path)` with their own logic to perform more detailed checks. :returns: True if the path is valid following the logic described above, or False if it is not .. method:: find_document_path(self) Find the location of this collection's documents (i.e., the raw document collection). We first check the collection's config for a path key. If found, `self.validate_document_path` checks whether the path is valid. Subclasses should override the private method `self._validate_document_path` with custom logic for performing checks further than existence of the directory. See `Robust04`. If a valid path was not found, call `download_if_missing`. Subclasses should override this method if downloading the needed documents is possible. If a valid document path cannot be found, an exception is thrown. :returns: path to this collection's raw documents .. method:: download_if_missing(self) .. py:class:: Robust04 Bases: :class:`capreolus.collection.Collection` .. attribute:: module_name :annotation: = robust04 .. attribute:: collection_type :annotation: = TrecCollection .. attribute:: generator_type :annotation: = DefaultLuceneDocumentGenerator .. attribute:: config_keys_not_in_path :annotation: = ['path'] .. attribute:: config_spec .. method:: download_if_missing(self) .. method:: download_index(self, cachedir, url, sha256, index_directory_inside, index_cache_path_string, index_expected_document_count) .. py:class:: DummyCollection Bases: :class:`capreolus.collection.Collection` .. attribute:: module_name :annotation: = dummy .. attribute:: collection_type :annotation: = TrecCollection .. attribute:: generator_type :annotation: = DefaultLuceneDocumentGenerator .. py:class:: ANTIQUE Bases: :class:`capreolus.collection.Collection` .. attribute:: module_name :annotation: = antique .. attribute:: collection_type :annotation: = TrecCollection .. attribute:: generator_type :annotation: = DefaultLuceneDocumentGenerator .. method:: download_if_missing(self) .. py:class:: MSMarco Bases: :class:`capreolus.collection.Collection` .. attribute:: module_name :annotation: = msmarco .. attribute:: config_keys_not_in_path :annotation: = ['path'] .. attribute:: collection_type :annotation: = TrecCollection .. attribute:: generator_type :annotation: = DefaultLuceneDocumentGenerator .. attribute:: config_spec .. py:class:: CodeSearchNet Bases: :class:`capreolus.collection.Collection` .. attribute:: module_name :annotation: = codesearchnet .. attribute:: url :annotation: = https://s3.amazonaws.com/code-search-net/CodeSearchNet/v2 .. attribute:: collection_type :annotation: = TrecCollection .. attribute:: generator_type :annotation: = DefaultLuceneDocumentGenerator .. attribute:: config_spec .. method:: download_if_missing(self) .. py:class:: COVID Bases: :class:`capreolus.collection.Collection` .. attribute:: module_name :annotation: = covid .. attribute:: url :annotation: = https://ai2-semanticscholar-cord-19.s3-us-west-2.amazonaws.com/historical_releases/cord-19_%s.tar.gz .. attribute:: generator_type :annotation: = Cord19Generator .. attribute:: config_spec .. method:: build(self) .. method:: download_if_missing(self) .. method:: transform_metadata(self, root_path) the transformation is necessary for dataset round 1 and 2 according to https://discourse.cord-19.semanticscholar.org/t/faqs-about-cord-19-dataset/94 the assumed directory under root_path: ./root_path ./metadata.csv ./comm_use_subset ./noncomm_use_subset ./custom_license ./biorxiv_medrxiv ./archive In a nutshell: 1. renaming: Microsoft Academic Paper ID -> mag_id; WHO #Covidence -> who_covidence_id 2. update: has_pdf_parse -> pdf_json_files # e.g. document_parses/pmc_json/PMC125340.xml.json has_pmc_xml_parse -> pmc_json_files