:mod:`capreolus.collection`
===========================

.. py:module:: capreolus.collection


Package Contents
----------------

Classes
~~~~~~~

.. autoapisummary::

   capreolus.collection.Collection
   capreolus.collection.Robust04
   capreolus.collection.DummyCollection
   capreolus.collection.ANTIQUE
   capreolus.collection.MSMarco
   capreolus.collection.CodeSearchNet
   capreolus.collection.COVID


.. data:: logger
   

.. data:: PACKAGE_PATH
   

.. py:class:: Collection

   Bases: :class:`profane.ModuleBase`

   .. attribute:: module_type
      :annotation: = collection

      
   .. attribute:: is_large_collection
      :annotation: = False

      
   .. method:: get_path_and_types(self)


   .. method:: validate_document_path(self, path)

      Attempt to validate the document collection at `path`.

      By default, this will only check whether `path` exists. Subclasses should override
      `_validate_document_path(path)` with their own logic to perform more detailed checks.

      :returns: True if the path is valid following the logic described above, or False if it is not


   .. method:: find_document_path(self)

      Find the location of this collection's documents (i.e., the raw document collection).

      We first check the collection's config for a path key. If found, `self.validate_document_path` checks
      whether the path is valid. Subclasses should override the private method `self._validate_document_path`
      with custom logic for performing checks further than existence of the directory. See `Robust04`.

      If a valid path was not found, call `download_if_missing`.
      Subclasses should override this method if downloading the needed documents is possible.

      If a valid document path cannot be found, an exception is thrown.

      :returns: path to this collection's raw documents


   .. method:: download_if_missing(self)


.. py:class:: Robust04

   Bases: :class:`capreolus.collection.Collection`

   .. attribute:: module_name
      :annotation: = robust04

      
   .. attribute:: collection_type
      :annotation: = TrecCollection

      
   .. attribute:: generator_type
      :annotation: = DefaultLuceneDocumentGenerator

      
   .. attribute:: config_keys_not_in_path
      :annotation: = ['path']

      
   .. attribute:: config_spec
      

   .. method:: download_if_missing(self)


   .. method:: download_index(self, cachedir, url, sha256, index_directory_inside, index_cache_path_string, index_expected_document_count)


.. py:class:: DummyCollection

   Bases: :class:`capreolus.collection.Collection`

   .. attribute:: module_name
      :annotation: = dummy

      
   .. attribute:: collection_type
      :annotation: = TrecCollection

      
   .. attribute:: generator_type
      :annotation: = DefaultLuceneDocumentGenerator

      
.. py:class:: ANTIQUE

   Bases: :class:`capreolus.collection.Collection`

   .. attribute:: module_name
      :annotation: = antique

      
   .. attribute:: collection_type
      :annotation: = TrecCollection

      
   .. attribute:: generator_type
      :annotation: = DefaultLuceneDocumentGenerator

      
   .. method:: download_if_missing(self)


.. py:class:: MSMarco

   Bases: :class:`capreolus.collection.Collection`

   .. attribute:: module_name
      :annotation: = msmarco

      
   .. attribute:: config_keys_not_in_path
      :annotation: = ['path']

      
   .. attribute:: collection_type
      :annotation: = TrecCollection

      
   .. attribute:: generator_type
      :annotation: = DefaultLuceneDocumentGenerator

      
   .. attribute:: config_spec
      

.. py:class:: CodeSearchNet

   Bases: :class:`capreolus.collection.Collection`

   .. attribute:: module_name
      :annotation: = codesearchnet

      
   .. attribute:: url
      :annotation: = https://s3.amazonaws.com/code-search-net/CodeSearchNet/v2

      
   .. attribute:: collection_type
      :annotation: = TrecCollection

      
   .. attribute:: generator_type
      :annotation: = DefaultLuceneDocumentGenerator

      
   .. attribute:: config_spec
      

   .. method:: download_if_missing(self)


.. py:class:: COVID

   Bases: :class:`capreolus.collection.Collection`

   .. attribute:: module_name
      :annotation: = covid

      
   .. attribute:: url
      :annotation: = https://ai2-semanticscholar-cord-19.s3-us-west-2.amazonaws.com/historical_releases/cord-19_%s.tar.gz

      
   .. attribute:: generator_type
      :annotation: = Cord19Generator

      
   .. attribute:: config_spec
      

   .. method:: build(self)


   .. method:: download_if_missing(self)


   .. method:: transform_metadata(self, root_path)

      the transformation is necessary for dataset round 1 and 2 according to
      https://discourse.cord-19.semanticscholar.org/t/faqs-about-cord-19-dataset/94

      the assumed directory under root_path:
      ./root_path
          ./metadata.csv
          ./comm_use_subset
          ./noncomm_use_subset
          ./custom_license
          ./biorxiv_medrxiv
          ./archive

      In a nutshell:
      1. renaming:
          Microsoft Academic Paper ID -> mag_id;
          WHO #Covidence -> who_covidence_id
      2. update:
          has_pdf_parse -> pdf_json_files  # e.g. document_parses/pmc_json/PMC125340.xml.json
          has_pmc_xml_parse -> pmc_json_files