A Lightweight Experiment & Resource Monitoring Tool 📺

Last update: Dec 28, 2022

Related tags

Overview

Lightweight Experiment & Resource Monitoring 📺

"Did I already run this experiment before? How many resources are currently available on my cluster?" If these are common questions you encounter during your daily life as a researcher, then mle-monitor is made for you. It provides a lightweight API for tracking your experiments using a pickle protocol database (e.g. for hyperparameter searches and/or multi-configuration/multi-seed runs). Furthermore, it comes with built-in resource monitoring on Slurm/Grid Engine clusters and local machines/servers.

mle-monitor provides three core functionalities:

MLEProtocol: A composable protocol database API for ML experiments.
MLEResource: A tool for obtaining server/cluster usage statistics.
MLEDashboard: A dashboard visualizing resource usage & experiment protocol.

To get started I recommend checking out the colab notebook and an example workflow.

`MLEProtocol`: Keeping Track of Your Experiments 📝

from mle_monitor import MLEProtocol

# Load protocol database or create new one -> print summary
protocol_db = MLEProtocol("mle_protocol.db", verbose=False)
protocol_db.summary(tail=10, verbose=True)

# Draft data to store in protocol & add it to the protocol
meta_data = {
    "purpose": "Grid search",  # Purpose of experiment
    "project_name": "MNIST",  # Project name of experiment
    "experiment_type": "hyperparameter-search",  # Type of experiment
    "experiment_dir": "experiments/logs",  # Experiment directory
    "num_total_jobs": 10,  # Number of total jobs to run
    ...
}
new_experiment_id = protocol_db.add(meta_data)

# ... train your 10 (pseudo) networks/complete respective jobs
for i in range(10):
    protocol_db.update_progress_bar(new_experiment_id)

# Wrap up an experiment (store completion time, etc.)
protocol_db.complete(new_experiment_id)

The meta data can contain the following keys:

Search Type	Description	Default
`purpose`	Purpose of experiment	`'None provided'`
`project_name`	Project name of experiment	`'default'`
`exec_resource`	Resource jobs are run on	`'local'`
`experiment_dir`	Experiment log storage directory	`'experiments'`
`experiment_type`	Type of experiment to run	`'single'`
`base_fname`	Main code script to execute	`'main.py'`
`config_fname`	Config file path of experiment	`'base_config.yaml'`
`num_seeds`	Number of evaluations seeds	1
`num_total_jobs`	Number of total jobs to run	1
`num_job_batches`	Number of jobs in single batch	1
`num_jobs_per_batch`	Number of sequential job batches	1
`time_per_job`	Expected duration: days-hours-minutes	`'00:01:00'`
`num_cpus`	Number of CPUs used in job	1
`num_gpus`	Number of GPUs used in job	0

Additionally you can synchronize the protocol with a Google Cloud Storage (GCS) bucket by providing cloud_settings. In this case also the results stored in experiment_dir will be uploaded to the GCS bucket, when you call protocol.complete().

# Define GCS settings - requires 'GOOGLE_APPLICATION_CREDENTIALS' env var.
cloud_settings = {
    "project_name": "mle-toolbox",  # GCP project name
    "bucket_name": "mle-protocol",  # GCS bucket name
    "use_protocol_sync": True,  # Whether to sync the protocol to GCS
    "use_results_storage": True,  # Whether to sync experiment_dir to GCS
}
protocol_db = MLEProtocol("mle_protocol.db", cloud_settings, verbose=True)

The `MLEResource`: Keeping Track of Your Resources 📉

On Your Local Machine

from mle_monitor import MLEResource

# Instantiate local resource and get usage data
resource = MLEResource(resource_name="local")
resource_data = resource.monitor()

On a Slurm Cluster

resource = MLEResource(
    resource_name="slurm-cluster",
    monitor_config={"partitions": ["<partition-1>", "<partition-2>"]},
)

On a Grid Engine Cluster

resource = MLEResource(
    resource_name="sge-cluster",
    monitor_config={"queues": ["<queue-1>", "<queue-2>"]}
)

The `MLEDashboard`: Dashboard Visualization 🎞️

from mle_monitor import MLEDashboard

# Instantiate dashboard with protocol and resource
dashboard = MLEDashboard(protocol, resource)

# Get a static snapshot of the protocol & resource utilisation printed in console
dashboard.snapshot()

# Run monitoring in while loop - dashboard
dashboard.live()

Installation ⏳

A PyPI installation is available via:

pip install mle-monitor

Alternatively, you can clone this repository and afterwards 'manually' install it:

git clone https://github.com/mle-infrastructure/mle-monitor.git
cd mle-monitor
pip install -e .

Development & Milestones for Next Release

You can run the test suite via python -m pytest -vv tests/. If you find a bug or are missing your favourite feature, feel free to contact me @RobertTLange or create an issue 🤗 .

Complete system for facial identity system. Include one-shot model, database operation, features visualization, monitoring

2 Dec 28, 2021

Comments

Is the dashboard pooling squeue?

Hey, Thanks for publishing the library, the dashboard looks great!

However, I was a bit concerned to see you are using squeue since the official documentation says

"Executing squeue sends a remote procedure call to slurmctld. If enough calls from squeue or other Slurm client commands that send remote procedure calls to the slurmctld daemon come in at once, it can result in a degradation of performance of the slurmctld daemon, possibly resulting in a denial of service.

Do not run squeue or other Slurm client commands that send remote procedure calls to slurmctld from loops in shell scripts or other programs. Ensure that programs limit calls to squeue to the minimum necessary for the information you are trying to gather."

Do you poll squeue or is there some other, smarter management of it that I missed?

Thanks, Eliahu

opened by eliahuhorwitz 0

Releases(v0.0.1)

v0.0.1(Dec 9, 2021)

Basic API for MLEProtocol, MLEResource & MLEDashboard:

from mle_monitor import MLEProtocol

# Load protocol database or create new one -> print summary
protocol_db = MLEProtocol("mle_protocol.db", verbose=False)
protocol_db.summary(tail=10, verbose=True)

# Draft data to store in protocol & add it to the protocol
meta_data = {
    "purpose": "Grid search",  # Purpose of experiment
    "project_name": "MNIST",  # Project name of experiment
    "experiment_type": "hyperparameter-search",  # Type of experiment
    "experiment_dir": "experiments/logs",  # Experiment directory
    "num_total_jobs": 10,  # Number of total jobs to run
    ...
}
new_experiment_id = protocol_db.add(meta_data)

# ... train your 10 (pseudo) networks/complete respective jobs
for i in range(10):
    protocol_db.update_progress_bar(new_experiment_id)

# Wrap up an experiment (store completion time, etc.)
protocol_db.complete(new_experiment_id)

Source code(tar.gz)
Source code(zip)

A Lightweight Experiment & Resource Monitoring Tool 📺

Related tags

Overview

Lightweight Experiment & Resource Monitoring 📺

MLEProtocol: Keeping Track of Your Experiments 📝

The MLEResource: Keeping Track of Your Resources 📉

On Your Local Machine

On a Slurm Cluster

On a Grid Engine Cluster

The MLEDashboard: Dashboard Visualization 🎞️

Installation ⏳

Development & Milestones for Next Release

You might also like...

Meta Representation Transformation for Low-resource Cross-lingual Learning

OpenDILab RL Kubernetes Custom Resource and Operator Lib

Punctuation Restoration using Transformer Models for High-and Low-Resource Languages

Byte-based multilingual transformer TTS for low-resource/few-shot language adaptation.

PyTorch Implementation of [1611.06440] Pruning Convolutional Neural Networks for Resource Efficient Inference

Real-Time Social Distance Monitoring tool using Computer Vision

An air quality monitoring service with a Raspberry Pi and a SDS011 sensor.

Attendance Monitoring with Face Recognition using Python

Complete system for facial identity system. Include one-shot model, database operation, features visualization, monitoring

Comments

Is the dashboard pooling squeue?

Releases(v0.0.1)

v0.0.1(Dec 9, 2021)

Owner

Predict stock movement with Machine Learning and Deep Learning algorithms

HeatNet is a python package that provides tools to build, train and evaluate neural networks designed to predict extreme heat wave events globally on daily to subseasonal timescales.

LeViT a Vision Transformer in ConvNet's Clothing for Faster Inference

Set of methods to ensemble boxes from different object detection models, including implementation of "Weighted boxes fusion (WBF)" method.

[ICCV 2021] Relaxed Transformer Decoders for Direct Action Proposal Generation

Code and datasets for TPAMI 2021

Deep Q-Learning Network in pytorch (not actively maintained)

Fast RFC3339 compliant Python date-time library

A torch.Tensor-like DataFrame library supporting multiple execution runtimes and Arrow as a common memory format

CityLearn Challenge Multi-Agent Reinforcement Learning for Intelligent Energy Management, 2020, PikaPika team

Dataloader tools for language modelling

Multi Camera Calibration

Implementing Graph Convolutional Networks and Information Retrieval Mechanisms using pure Python and NumPy

Keras-retinanet - Keras implementation of RetinaNet object detection.

Point cloud processing tool library.

Official Pytorch Implementation of Relational Self-Attention: What's Missing in Attention for Video Understanding

Simple node deletion tool for onnx.

code for EMNLP 2019 paper Text Summarization with Pretrained Encoders

A check for whether the dependency jobs are all green.

这是一个yolox-keras的源码，可以用于训练自己的模型。

`MLEProtocol`: Keeping Track of Your Experiments 📝

The `MLEResource`: Keeping Track of Your Resources 📉

The `MLEDashboard`: Dashboard Visualization 🎞️