A library to generate synthetic time series data by easy-to-use factors and generator

Last update: Dec 20, 2022

Overview

timeseries-generator

This repository consists of a python packages that generates synthetic time series dataset in a generic way (under /timeseries_generator) and demo notebooks on how to generate synthetic timeseries data (under /examples). The goal here is to have non-sensitive data available to demo solutions and test the effectiveness of those solutions and/or algorithms. In order to test your algorithm, you want to have time series available containing different kinds of trends. The python package should help create different kinds of time series while still being maintainable.

`timeseries_generator` package

For this package, it is assumed that a time series is composed of a base value multiplied by many factors.

ts = base_value * factor1 * factor2 * ... * factorN + Noiser

These factors can be anything, random noise, linear trends, to seasonality. The factors can affect different features. For example, some features in your time series may have a seasonal component, while others do not.

Different factors are represented in different classes, which inherit from the BaseFactor class. Factor classes are input for the Generator class, which creates a dataframe containing the features, base value, all the different factors working on the base value and and the final factor and value.

Core concept

Generator: a python class to generate the time series. A generator contains a list of factors and noiser. By overlaying the factors and noiser, generator can produce a customized time series
Factor: a python class to generate the trend, seasonality, holiday factors, etc. Factors take effect by multiplying on the base value of the generator.
Noised: a python class to generate time series noise data. Noiser take effect by summing on top of "factorized" time series. This formula describes the concepts we talk above

Built-in Factors

LinearTrend: give a linear trend based on the input slope and intercept
CountryYearlyTrend: give a yearly-based market cap factor based on the GDP per - capita.
EUEcoTrendComponents: give a monthly changed factor based on EU industry product public data
HolidayTrendComponents: simulate the holiday sale peak. It adapts the holiday days - differently in different country
BlackFridaySaleComponents: simulate the BlackFriday sale event
WeekendTrendComponents: more sales at weekends than on weekdays
FeatureRandFactorComponents: set up different sale amount for different stores and different product
ProductSeasonTrendComponents: simulate season-sensitive product sales. In this example code, we have 3 different types of product:
- winter jacket: inverse-proportional to the temperature, more sales in winter
- basketball top: proportional to the temperature, more sales in summer
- Yoga Mat: temperature insensitive

Installation

pip install timeseries-generator

Usage

from timeseries_generator import LinearTrend, Generator, WhiteNoise, RandomFeatureFactor
import pandas as pd

# setting up a linear tren
lt = LinearTrend(coef=2.0, offset=1., col_name="my_linear_trend")
g = Generator(factors={lt}, features=None, date_range=pd.date_range(start="01-01-2020", end="01-20-2020"))
g.generate()
g.plot()

# update by adding some white noise to the generator
wn = WhiteNoise(stdev_factor=0.05)
g.update_factor(wn)
g.generate()
g.plot()

Example Notebooks

We currently have 2 example notebooks available:

generate_stationary_process: Good for introducing the basics of the timeseries_generator. Shows how to apply simple linear trends and how to introduce features and labels, as well as random noise.
use_external_factors: Goes more into detail and shows how to use the external_factors submodule. Shows how to create seasonal trends.

Web based prototyping UI

We also use Streamlit to build a web-based UI to demonstrate how to use this package to generate synthesis time series data in an interactive web UI.

streamlit run examples/streamlit/app.py

License

This package is released under the Apache License, Version 2.0

A machine learning toolkit dedicated to time-series data

tslearn The machine learning toolkit for time series analysis in Python Section Description Installation Installing the dependencies and tslearn Getti

2.3k Jan 5, 2023

Tool for producing high quality forecasts for time series data that has multiple seasonality with linear or non-linear growth.

Prophet: Automatic Forecasting Procedure Prophet is a procedure for forecasting time series data based on an additive model where non-linear trends ar

15.4k Jan 7, 2023

A machine learning toolkit dedicated to time-series data

tslearn The machine learning toolkit for time series analysis in Python Section Description Installation Installing the dependencies and tslearn Getti

2.3k Dec 29, 2022

Visualize classified time series data with interactive Sankey plots in Google Earth Engine

sankee Visualize changes in classified time series data with interactive Sankey plots in Google Earth Engine Contents Description Installation Using P

76 Dec 15, 2022

PyPOTS - A Python Toolbox for Data Mining on Partially-Observed Time Series

A python toolbox/library for data mining on partially-observed time series, supporting tasks of forecasting/imputation/classification/clustering on incomplete multivariate time series with missing values.

179 Dec 31, 2022

A collection of Scikit-Learn compatible time series transformers and tools.

tsfeast A collection of Scikit-Learn compatible time series transformers and tools. Installation Create a virtual environment and install: From PyPi p

0 Mar 30, 2022

Automatic extraction of relevant features from time series:

tsfresh This repository contains the TSFRESH python package. The abbreviation stands for "Time Series Feature extraction based on scalable hypothesis

7k Jan 6, 2023

A unified framework for machine learning with time series

Welcome to sktime A unified framework for machine learning with time series We provide specialized time series algorithms and scikit-learn compatible

6k Jan 6, 2023

Probabilistic time series modeling in Python

GluonTS - Probabilistic Time Series Modeling in Python GluonTS is a Python toolkit for probabilistic time series modeling, built around Apache MXNet (

3.3k Jan 3, 2023

Comments

Time series data augmentation

There is a code example that gives to increase the amount of series data by adding slightly modified copies of already existing time series data or newly created synthetic series data from existing data?

opened by YAYAYru 0

KeyError: 'country'

From the following code,

from timeseries_generator import HolidayFactor, LinearTrend, Generator

lt = LinearTrend(coef=2.0, offset=1., col_name="my_linear_trend")

g: Generator = Generator(factors={lt}, features=None, date_range=pd.date_range(start="01-01-2020", end="01-01-2021"))

holiday_factor = HolidayFactor(
    country_feature_name="country",
)
g.add_factor(holiday_factor)
g.generate()

I get the error. I am not sure this is expected behavior.

File /usr/local/Caskroom/miniconda/base/envs/tf/lib/python3.9/site-packages/pandas/core/frame.py:10083, in DataFrame.merge(self, right, how, on, left_on, right_on, left_index, right_index, sort, suffixes, copy, indicator, validate)
...
-> 1849     raise KeyError(key)
   1851 # Check for duplicates
   1852 if values.ndim > 1:

KeyError: 'country'

opened by twobitunicorn 0

[Feature request] Customizable feature combinations
Hi team, Thanks for the useful library! I wonder if you'd be open to this idea:

I would like to be able to:

Set up categorizing features (let's say, for illustration, CATEGORY=[footwear, t-shirts, socks], SIZE=[S, M, L, US-Mens-8, US-Womens-6) and define Factors on them

Generate time-series with more restricted feature combinations than the outer product (again for illustration, "t-shirt sizes for t-shirts, shoe sizes for footwear")

Today, it seems like Generator.generate() hard-codes the assumption that time-series should be generated for the product of all provided feature values.

It'd be helpful if, instead, we could have the option of customizing this join to limit down generated combinations?

Some options I can think of:

Leave the library as-is: Users generate full outer product and limit down what they want in post-processing

This seems possible already, but very RAM-intensive if your desired combinations are sparse?

Accept an optional dataframe of factor combinations as parameter to the generate() method

Gives full flexibility over which combinations are kept / ignored, without assuming any particular rigid hierarchies between features

...But might need to do a bit of validation to protect against user errors? May not be super easy to use without some documented examples / functions to generate the dataframe

Some more complex API for feature configuration that accommodates specifying valid/invalid feature combinations

Might be nicer for usability, but difficult to make general: E.g. a straightforward hierarchy could be represented as a nested dict, but in practice many applications have multiple intersecting views of product category information e.g. brand, type, target segment, etc.
opened by athewsey 1
Generate hourly data

First of all, thank you for making this repository public! I enjoy its ease of use and the built-in factors.

Problem description

I'm currently trying to generate revenue data for a bar/restaurant on an hourly basis. As far as I can see, the timeseries-generator only supports generating one data point per day, not per hour.

I tried to generate hourly data like g = Generator(factors={lt}, features=None, date_range=pd.date_range(start='15/9/2021', end='30/9/2021', freq='h')) which didn't work.

Potential solution

Add the possibility to generate hourly data too. If this is a promising idea in your opinion, I'm willing to contribute to the implementation.

Thank you in advance!

opened by nileger 1

Releases(v0.1.0)

v0.1.0(Jul 20, 2021)
first release of time series generators, including:

base factor

linear trend factor

sinusoidal factor

white noise factor

random factor

holiday factor

weekday factor

country GDP factor

EU industry index factor

Examples

notebooks which includes some simple examples

streamlit dashboard

Source code(tar.gz)
Source code(zip)

Owner

Nike Inc.

GitHub Repository

ML Kaggle Titanic Problem using LogisticRegrission

-ML-Kaggle-Titanic-Problem-using-LogisticRegrission here you will find the solution for the titanic problem on kaggle with comments and step by step c

3 Oct 23, 2022

Data from "Datamodels: Predicting Predictions with Training Data"

Data from "Datamodels: Predicting Predictions with Training Data" Here we provid

51 Dec 09, 2022

learn python in 100 days, a simple step could be follow from beginner to master of every aspect of python programming and project also include side project which you can use as demo project for your personal portfolio

learn python in 100 days, a simple step could be follow from beginner to master of every aspect of python programming and project also include side project which you can use as demo project for your

6 Nov 05, 2022

MBTR is a python package for multivariate boosted tree regressors trained in parameter space.

61 Dec 19, 2022

Bayesian optimization in JAX

26 May 11, 2022

DeepSpeed is a deep learning optimization library that makes distributed training easy, efficient, and effective.

DeepSpeed is a deep learning optimization library that makes distributed training easy, efficient, and effective. 10x Larger Models 10x Faster Trainin

8.4k Dec 30, 2022

To design and implement the Identification of Iris Flower species using machine learning using Python and the tool Scikit-Learn.

1 Jan 11, 2022

Pandas DataFrames and Series as Interactive Tables in Jupyter

Pandas DataFrames and Series as Interactive Tables in Jupyter Star Turn pandas DataFrames and Series into interactive datatables in both your notebook

364 Jan 04, 2023

Predict the demand for electricity (R) - FRENCH

06.demand-electricity Predict the demand for electricity (R) - FRENCH Prédisez la demande en électricité Prérequis Pour effectuer ce projet, vous devr

1 Feb 13, 2022

This is an implementation of the proximal policy optimization algorithm for the C++ API of Pytorch

This is an implementation of the proximal policy optimization algorithm for the C++ API of Pytorch. It uses a simple TestEnvironment to test the algorithm

59 Dec 09, 2022

Lingtrain Alignment Studio is an ML based app for texts alignment on different languages.

Lingtrain Alignment Studio Intro Lingtrain Alignment Studio is the ML based app for accurate texts alignment on different languages. Extracts parallel

186 Jan 03, 2023

Module for statistical learning, with a particular emphasis on time-dependent modelling

Operating system Build Status Linux/Mac Windows tick tick is a Python 3 module for statistical learning, with a particular emphasis on time-dependent

410 Dec 14, 2022

Apache Spark & Python (pySpark) tutorials for Big Data Analysis and Machine Learning as IPython / Jupyter notebooks

Spark Python Notebooks This is a collection of IPython notebook/Jupyter notebooks intended to train the reader on different Apache Spark concepts, fro

1.5k Jan 02, 2023

Predicting Baseball Metric Clusters: Clustering Application in Python Using scikit-learn

Clustering Clustering Application in Python Using scikit-learn This repository contains the prediction of baseball metric clusters using MLB Statcast

2 Apr 18, 2022

Compare MLOps Platforms. Breakdowns of SageMaker, VertexAI, AzureML, Dataiku, Databricks, h2o, kubeflow, mlflow...

318 Jan 02, 2023

Gaussian Process Optimization using GPy

End of maintenance for GPyOpt Dear GPyOpt community! We would like to acknowledge the obvious. The core team of GPyOpt has moved on, and over the past

847 Dec 19, 2022

Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow

eXtreme Gradient Boosting Community | Documentation | Resources | Contributors | Release Notes XGBoost is an optimized distributed gradient boosting l

23.6k Jan 03, 2023

A library to generate synthetic time series data by easy-to-use factors and generator

Related tags

Overview

timeseries-generator

timeseries_generator package

Core concept

Built-in Factors

Installation

Usage

Example Notebooks

Web based prototyping UI

License

You might also like...

A machine learning toolkit dedicated to time-series data

Tool for producing high quality forecasts for time series data that has multiple seasonality with linear or non-linear growth.

A machine learning toolkit dedicated to time-series data

Visualize classified time series data with interactive Sankey plots in Google Earth Engine

PyPOTS - A Python Toolbox for Data Mining on Partially-Observed Time Series

A collection of Scikit-Learn compatible time series transformers and tools.

Automatic extraction of relevant features from time series:

A unified framework for machine learning with time series

Probabilistic time series modeling in Python

Comments

Time series data augmentation

KeyError: 'country'

[Feature request] Customizable feature combinations

Generate hourly data

Problem description

Potential solution

Releases(v0.1.0)

v0.1.0(Jul 20, 2021)

Owner

Nike Inc.

ML Kaggle Titanic Problem using LogisticRegrission

Data from "Datamodels: Predicting Predictions with Training Data"

learn python in 100 days, a simple step could be follow from beginner to master of every aspect of python programming and project also include side project which you can use as demo project for your personal portfolio

MBTR is a python package for multivariate boosted tree regressors trained in parameter space.

Bayesian optimization in JAX

DeepSpeed is a deep learning optimization library that makes distributed training easy, efficient, and effective.

To design and implement the Identification of Iris Flower species using machine learning using Python and the tool Scikit-Learn.

Pandas DataFrames and Series as Interactive Tables in Jupyter

Predict the demand for electricity (R) - FRENCH

This is an implementation of the proximal policy optimization algorithm for the C++ API of Pytorch

Lingtrain Alignment Studio is an ML based app for texts alignment on different languages.

Module for statistical learning, with a particular emphasis on time-dependent modelling

Apache Spark & Python (pySpark) tutorials for Big Data Analysis and Machine Learning as IPython / Jupyter notebooks

Predicting Baseball Metric Clusters: Clustering Application in Python Using scikit-learn

Compare MLOps Platforms. Breakdowns of SageMaker, VertexAI, AzureML, Dataiku, Databricks, h2o, kubeflow, mlflow...

Gaussian Process Optimization using GPy

Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow

vortex particles for simulating smoke in 2d

A webpage that utilizes machine learning to extract sentiments from tweets.

Arquivos do curso online sobre a estatística voltada para ciência de dados e aprendizado de máquina.

`timeseries_generator` package