Skip to content

CoronaWhy/scibart

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

21 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

scibart

So far, this is a refactoring job of a notebook in CurationCorp's amazing curation-corpus repository for training on GPU clusters to tune BART for abstractive summarization of scientific literature.

Part of the Coronawhy project.

How to create a dataset from scratch

Currently, the dataset is sourced as follows:

To create a new dataset from scratch:

  1. Download the ArXiv and Semantic Scholar Corpus datasets from gdrive (as described here) and unzip into raw_data/ArxivStructuredAbstractSectionalSummaries and raw_data/SemanticScholarAbstractSectionSummaryDataSet
  2. Download wikihowAll.csv (as described here) into raw_data/wikihow
  3. Scrape the Curation Corpus dataset as explained in the repo, then move curation-corpus-base-with-articles.csv to raw_data/curation_corpus
  4. Run python src/data/create_dataset.py. This will create a new folder called data with ~40 compressed parquet files

The current dataset is stored in a single pandas dataframe with the following schema:

Column name Column Type Description
text str Original text on which the summary is based
summary str Summary of the original text
data_src str Directory name of the original dataset in raw_data

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published