scibart

So far, this is a refactoring job of a notebook in CurationCorp's amazing curation-corpus repository for training on GPU clusters to tune BART for abstractive summarization of scientific literature.

Part of the Coronawhy project.

How to create a dataset from scratch

Currently, the dataset is sourced as follows:

Text-abstract pairs from Arxiv and the Semantic Scholar Corpus as provided by Santosh-Gupta's ScientificSummarizationDataSets repo
Text-headline pairs from WikiHow, provided by mahnazkoupaee's WikiHow-Dataset repo
Curation Corpus

To create a new dataset from scratch:

Download the ArXiv and Semantic Scholar Corpus datasets from gdrive (as described here) and unzip into raw_data/ArxivStructuredAbstractSectionalSummaries and raw_data/SemanticScholarAbstractSectionSummaryDataSet
Download wikihowAll.csv (as described here) into raw_data/wikihow
Scrape the Curation Corpus dataset as explained in the repo, then move curation-corpus-base-with-articles.csv to raw_data/curation_corpus
Run python src/data/create_dataset.py. This will create a new folder called data with ~40 compressed parquet files

The current dataset is stored in a single pandas dataframe with the following schema:

Column name	Column Type	Description
text	str	Original text on which the summary is based
summary	str	Summary of the original text
data_src	str	Directory name of the original dataset in `raw_data`

Name	Name	Last commit message	Last commit date
Latest commit awalther Merge pull request #5 from awalther/feature-3-add-datasets Apr 13, 2020 4264dc0 · Apr 13, 2020 History 21 Commits
notebooks	notebooks	Adjust BART notebook to use repo data	Apr 11, 2020
raw_data	raw_data	Add raw_data folder template	Apr 13, 2020
src	src	Update data-generating script	Apr 13, 2020
.dockerignore	.dockerignore	Update .gitignore and .dockerignore	Apr 13, 2020
.gitignore	.gitignore	Update .gitignore and .dockerignore	Apr 13, 2020
Dockerfile	Dockerfile	Move training code to model subdir. Update Dockerfile	Apr 13, 2020
Makefile	Makefile	Add Dockerfile to run training	Apr 11, 2020
README.md	README.md	Fix README	Apr 13, 2020
requirements-train.txt	requirements-train.txt	Fix training requirements	Apr 12, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

scibart

How to create a dataset from scratch

About

Releases

Packages

Languages

CoronaWhy/scibart

Folders and files

Latest commit

History

Repository files navigation

scibart

How to create a dataset from scratch

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages