capreolus.collection.covid

Module Contents

Classes

COVID

The COVID-19 Open Research Dataset (https://www.semanticscholar.org/cord19)

Attributes

logger

PACKAGE_PATH

capreolus.collection.covid.logger[source]
capreolus.collection.covid.PACKAGE_PATH[source]
class capreolus.collection.covid.COVID(config=None, provide=None, share_dependency_objects=False, build=True)[source]

Bases: capreolus.collection.Collection

The COVID-19 Open Research Dataset (https://www.semanticscholar.org/cord19)

module_name = covid[source]
url = https://ai2-semanticscholar-cord-19.s3-us-west-2.amazonaws.com/historical_releases/cord-19_%s.tar.gz[source]
generator_type = Cord19Generator[source]
config_spec[source]
build(self)[source]
download_if_missing(self)[source]

Download the collection and return its path. Subclasses should override this.

transform_metadata(self, root_path)[source]

the transformation is necessary for dataset round 1 and 2 according to https://discourse.cord-19.semanticscholar.org/t/faqs-about-cord-19-dataset/94

the assumed directory under root_path: ./root_path

./metadata.csv ./comm_use_subset ./noncomm_use_subset ./custom_license ./biorxiv_medrxiv ./archive

In a nutshell: 1. renaming:

Microsoft Academic Paper ID -> mag_id; WHO #Covidence -> who_covidence_id

  1. update:

    has_pdf_parse -> pdf_json_files # e.g. document_parses/pmc_json/PMC125340.xml.json has_pmc_xml_parse -> pmc_json_files