BigDL - Evaluate the performance of BigDL (Distributed Deep Learning on Apache Spark) in big data analysis problems

Last update: Jan 06, 2022

Related tags

Overview

Evaluate the performance of BigDL (Distributed Deep Learning on Apache Spark) in big data analysis problems.

Introduction

BigDL is a distributed deep learning library for Apache Spark; with BigDL, users can write their deep learning applications as standard Spark programs, which can directly run on top of existing Spark or Hadoop clusters.

Installation

Please download BigDL Packages or pip install BigDL (conda)

How to run Program on Spark

Usage: spark-submit-with-bigdl.sh + [options] + file.py

Options:

master MASTER URL: spark, yarn, k8s, local.
local[k]: Run Spark locally with k worker threads as logical cores on your machine.
File.py: File for executing program.

System configuration

Program run on system includes:

System/Host Processor: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
CPU(s): 48
Core(s) per socket: 12
Socket(s): 2
Memory: 183 G (free)

Data Description and Run Model

It is a dataset of 60,000 small square 28×28 pixel grayscale images of handwritten single digits between 0 and 9. The MNIST data is split into three parts: 60,000 data points of training data, 10,000 points of test data.

With this BigDL Problem, We use LSTM model for MNIST digit classification problem.

BigDL - Evaluate the performance of BigDL (Distributed Deep Learning on Apache Spark) in big data analysis problems

Related tags

Overview

Evaluate the performance of BigDL (Distributed Deep Learning on Apache Spark) in big data analysis problems.

Introduction

Installation

How to run Program on Spark

System configuration

Data Description and Run Model

BigDL Performance Evaluation

Execution running time

Computation Evaluation (SPEED UP)

Owner

Vo Cong Thanh

Python tools for querying and manipulating BIDS datasets.

Exploring the Top ML and DL GitHub Repositories

Datashredder is a simple data corruption engine written in python. You can corrupt anything text, images and video.

DefAP is a program developed to facilitate the exploration of a material's defect chemistry

CSV database for chihuahua (HUAHUA) blockchain transactions

Pandas on AWS - Easy integration with Athena, Glue, Redshift, Timestream, QuickSight, Chime, CloudWatchLogs, DynamoDB, EMR, SecretManager, PostgreSQL, MySQL, SQLServer and S3 (Parquet, CSV, JSON and EXCEL).

Create HTML profiling reports from pandas DataFrame objects

Analyze the Gravitational wave data stored at LIGO/VIRGO observatories

Senator Trades Monitor

Analyse the limit order book in seconds. Zoom to tick level or get yourself an overview of the trading day.

PandaPy has the speed of NumPy and the usability of Pandas 10x to 50x faster (by @firmai)

Projects that implement various aspects of Data Engineering.

ToeholdTools is a Python package and desktop app designed to facilitate analyzing and designing toehold switches, created as part of the 2021 iGEM competition.

Pandas-based utility to calculate weighted means, medians, distributions, standard deviations, and more.

Investigating EV charging data

This creates a ohlc timeseries from downloaded CSV files from NSE India website and makes a SQLite database for your research.

X-news - Pipeline data use scrapy, kafka, spark streaming, spark ML and elasticsearch, Kibana

talkbox is a scikit for signal/speech processing, to extend scipy capabilities in that domain.

A distributed block-based data storage and compute engine

A library to create multi-page Streamlit applications with ease.