capreolus.collection.covid
¶
Module Contents¶
Classes¶
COVID |
The COVID-19 Open Research Dataset (https://www.semanticscholar.org/cord19) |
-
class
capreolus.collection.covid.
COVID
(config=None, provide=None, share_dependency_objects=False, build=True)[source]¶ Bases:
capreolus.collection.Collection
The COVID-19 Open Research Dataset (https://www.semanticscholar.org/cord19)
-
url
= https://ai2-semanticscholar-cord-19.s3-us-west-2.amazonaws.com/historical_releases/cord-19_%s.tar.gz[source]¶
-
download_if_missing
(self)[source]¶ Download the collection and return its path. Subclasses should override this.
-
transform_metadata
(self, root_path)[source]¶ the transformation is necessary for dataset round 1 and 2 according to https://discourse.cord-19.semanticscholar.org/t/faqs-about-cord-19-dataset/94
the assumed directory under root_path: ./root_path
./metadata.csv ./comm_use_subset ./noncomm_use_subset ./custom_license ./biorxiv_medrxiv ./archiveIn a nutshell: 1. renaming:
Microsoft Academic Paper ID -> mag_id; WHO #Covidence -> who_covidence_id- update:
- has_pdf_parse -> pdf_json_files # e.g. document_parses/pmc_json/PMC125340.xml.json has_pmc_xml_parse -> pmc_json_files
-