Data Science and Engineering with Apache Spark

This deposit is about all assignments I've finished for online courses about how to perform data science and data engineering at scale using Spark. Every assignment requires python coding fluency and deep understanding for Spark framework and scalable algorithml. Besides, Familiarity with basic machine learning concepts and exposure to algorithms, probability, linear algebra and calculus is a must to complete all those assignments.

This series contains three courses in total:

Introduction to Apache Spark: fundamentals and architecture of Apache Spark

lab one:

Part 1: Basic notebook usage and Python integration

Part 2: An introduction to using Apache Spark with the PySpark SQL API running in a notebook

Part 3: Using DataFrames and chaining together transformations and actions

Part 4: Python Lambda functions and User Defined Functions

Part 5: Additional DataFrame actions

Part 6: Additional DataFrame transformations

Part 7: Caching DataFrames and storage options

Part 8: Debugging Spark applications and lazy evaluation

lab two:

Part 1: Introduction and Imports

Part 2: Exploratory Data Analysis

Part 3: Analysis Walk-Through on the Web Server Log File

Part 4: Analyzing Web Server Log File

Part 5: Exploring 404 Response Codes

Distributed Machine Learning with Apache Spark

lab one:

Basic Machine Learningn concepts

supervised learning pipelines

linear algebra

computational complexity/big O notation

RDD data structure

lab two;

Linear regression formulation and closed-form solution

Distributed machine learning principles (related to computation, storage, and communication)

Develop an end-to-end linear regression pipeline to predict the release year of a song given a set of audio features. Implement a gradient descent solver for linear regression, use Spark's machine Learning library (mllib) to train additional models, tune models via grid search, improve accuracy using quadratic features, and visualize various intermediate results to build intuition. Finally, write a concise version of this pipeline using Spark's pipeline API.

lab three:

Online advertising, linear classification, logistic regression, working with probabilistic predictions, categorical data and one-hot-encoding, feature hashing for dimensionality reduction.

Construct a logistic regression pipeline to predict click-through rate using data from a recent Kaggle competition. Extract numerical features from the raw categorical data using one-hot-encoding, reduce the dimensionality of these features via hashing, train logistic regression models using mllib, tune hyperparameter via grid search, and interpret probabilistic predictions via a ROC plot.

lab four:

Introduction to neuroscience and neuroimaging data, exploratory data analysis, principal component analysis (PCA) formulations and solution, distributed PCA.

Neuroimaging Analysis via PCA - Identify patterns of brain activity in larval zebrafish. Work with time-varying images (generated using a technique called light-sheet microscopy) that capture a zebrafish's neural activity as it is presented with a moving visual pattern. After implementing distributed PCA from scratch and gaining intuition by working with synthetic data, Use PCA to identify distinct patterns across the zebrafish brain that are induced by different types of stimuli.

MiyainNYC/Distributed-Machine-Learning

ErrorLooks like something went wrong!

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Error
Looks like something went wrong!

Packages