Formulae is a Python library that implements Wilkinson's formulas for mixed-effects models.

Last update: Dec 21, 2022

Overview

formulae

formulae is a Python library that implements Wilkinson's formulas for mixed-effects models. The main difference with other implementations like Patsy or formulaic is that formulae can work with formulas describing a model with both common and group specific effects (a.k.a. fixed and random effects, respectively).

This package has been written to make it easier to specify models with group effects in Bambi, a package that makes it easy to work with Bayesian GLMMs in Python, but it could be used independently as a backend for another library. The approach in this library is to extend classical statistical formulas in a similar way than in R package lme4.

Installation

formulae requires a working Python interpreter (3.7+) and the libraries numpy, scipy and pandas with versions specified in the requirements.txt file.

Assuming a standard Python environment is installed on your machine (including pip), the latest release of formulae can be installed in one line using pip:

pip install formulae

Alternatively, if you want the development version of the package you can install from GitHub:

pip install git+https://github.com/bambinos/formulae.git

Documentation

The official documentation can be found here

Notes

The data argument only accepts objects of class pandas.DataFrame.
y ~ . is not implemented and won't be implemented in a first version. However, it is planned to be included in the future.

Comments

Feature Request: Splines
Hi all. I would like to start with my appreciation for your work on Bambi and formulae! I think both of these are very valuable additions to the scientific python stack. I saw a thread on twitter that made me think formulae might become the new patsy, so I thought I would describe an affordance in patsy that would be really cool to add to formulae: splines: https://patsy.readthedocs.io/en/latest/spline-regression.html

I would love to be able to include bs(x) in one line, e.g.

model = bmb.Model("test_positive ~ vaccinated + bs(age,df=3) + (vaccinated+vaccinated:day|state)", df, family="bernoulli")

Since this is already implemented in patsy maybe it will not be too hard to add here, as well! Thanks again for all of your work!
opened by aflaxman 6

model definition

I would like to fit a fully specified model. In lme I would write this as:

f = "choice ~ 1 + stimulus*reward + (1+stimulus*reward|subject_id)"`

In formulae this gives the following error:

import formulae as fm
fm.model_description(f)

    891                 return NotImplemented
    892         else:
--> 893             raise ValueError("LHS of group specific term cannot have more than one term.")
    894 
    895     def __repr__(self):

The following does seem to work:

f = "choice ~ 1 + stimulus*reward + (1|subject_id) + (stimulus|subject_id) + (reward|subject_id) + (stimulus:reward|subject_id)"
fm.model_description(f)

Is this the syntax for a fully specified model in formulae? Output:

ModelTerms(
  ResponseTerm(
    Term(
      name= choice
      variable= choice
      kind= None
    )
  ),
  InterceptTerm(),
  Term(
    name= stimulus
    variable= stimulus
    kind= None
  ),
  Term(
    name= reward
    variable= reward
    kind= None
  ),
  InteractionTerm(
    name= stimulus:reward
    variables= {'stimulus', 'reward'}
  ),
  GroupSpecTerm(
    expr= InterceptTerm(),
    factor= Term(
      name= subject_id
      variable= subject_id
      kind= None
    )
  ),
  GroupSpecTerm(
    expr= Term(
      name= stimulus
      variable= stimulus
      kind= None
    ),
    factor= Term(
      name= subject_id
      variable= subject_id
      kind= None
    )
  ),
  GroupSpecTerm(
    expr= Term(
      name= reward
      variable= reward
      kind= None
    ),
    factor= Term(
      name= subject_id
      variable= subject_id
      kind= None
    )
  ),
  GroupSpecTerm(
    expr= InteractionTerm(
      name= stimulus:reward
      variables= {'stimulus', 'reward'}
    ),
    factor= Term(
      name= subject_id
      variable= subject_id
      kind= None
    )
  )
)

opened by jwdegee 6

Correlated Random Effects

Hi.

In Ch3 of the lme4 book manuscript, Bates showed two cases of random effects, one is correlated and the other is uncorrelated (independent):

The correlated one is from formula Reaction ~ 1 + Days + (1 + Days | Subject), the uncorrelated on is from Reaction ~ 1 + Days + (1|Subject) + (0 + Days|Subject).

I tried the same formulae with formulae, the two design matrices are the same:

from formulae import design_matrices
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

data = pd.read_csv('../datasets/sleepstudy.csv')

f1 = 'Reaction ~ 1 + Days + (1 + Days | Subject)'
f2 = 'Reaction ~ 1 + Days + (1|Subject) + (0 + Days|Subject)'

fig, ax = plt.subplots(2, 1, figsize=(12, 6))
for i, formula in enumerate([f1, f2]):
    
    dm = design_matrices(formula, data)
    terms = dm.group.terms_info.keys()

    Zs = [dm.group[key] for key in dm.group.terms_info.keys()]
    Z = np.hstack(Zs)
    Z[Z ==0] = np.nan
    ax[i].imshow(Z.T)
    ax[i].set_title(formula)

Is this an intended behavior for formulae or is it a bug?

opened by huangziwei 4

create MANIFEST.in

The source tarball is broken b/c a required file in setup.py is not shipped with the source. This PR adds a MANIFEST.in file to ensure the necessary files will be packaged.

See https://github.com/conda-forge/staged-recipes/pull/14193 for an example of the problem with the current tarball.

opened by ocefpaf 4
Update documentation
This PR attempts to improve existing online documentation. Many classes, methods, and functions are currently documented, but this documentation is not shown on the webpage. The general aim is to have three broad sections.

Examples: A collection of examples using formulae to obtain design matrices and fit statistical models.

API Reference: The documentation for the objects that we expect the users to interact with.

Internals: A documentation made for people wanting to use formulae for their own libraries.

These tasks may be split into several PRs. The goal of this PR is to improve the Internals reference. Examples will come later.
opened by tomicapretto 3
Adding support for survival functions #79

This PR adds a transformation for survival type responses. It is similar to R's Surv.

Surv(time, status) ~ returns a 2xn matrix with the time and the event status. These can then be used in the model to handle censored and uncensored data differently.

#79

opened by ipa 2
Add contrasts
Add general contrasts so we can use other encodings apart from reference encoding.

Stop reshaping 1d arrays to column vectors. It's not really necessary.
opened by tomicapretto 2

Add basis spline stateful transformation

Adds an equivalent of the bs() function in the splines package in R. This is a stateful transformation by the way (it remember knots, bounds, and whether it has intercept or not).

Using the engine dataset from the R package gamair, we have:

import pandas as pd
from formulae import  design_matrices

data = pd.DataFrame({
    "size": [
        2.78, 2.43, 2.32, 2.43, 2.98, 2.32, 2.32, 2.32, 2.32, 2.13, 
        2.13, 2.13, 2.98, 1.99, 1.99, 1.99, 1.78, 1.58, 1.42
    ],
    "wear": [
        3, 3.1, 4.8, 3.3, 2.8, 2.9, 3.8, 3, 2.7, 3.2, 2.4, 2.6, 1.7,
        2.6, 2.8, 2.4, 2.5, 4.2, 4
    ]
})
design_matrices("wear ~ bs(size, 4)", data).common.design_matrix

array([[1.        , 0.00498077, 0.09481583, 0.56163869, 0.33856471],
       [1.        , 0.10358454, 0.429405  , 0.46238083, 0.00462963],
       [1.        , 0.17899408, 0.48816568, 0.33284024, 0.        ],
       [1.        , 0.10358454, 0.429405  , 0.46238083, 0.00462963],
       [1.        , 0.        , 0.        , 0.        , 1.        ],
       [1.        , 0.17899408, 0.48816568, 0.33284024, 0.        ],
       [1.        , 0.17899408, 0.48816568, 0.33284024, 0.        ],
       [1.        , 0.17899408, 0.48816568, 0.33284024, 0.        ],
       [1.        , 0.17899408, 0.48816568, 0.33284024, 0.        ],
       [1.        , 0.36011331, 0.46706614, 0.16341177, 0.        ],
       [1.        , 0.36011331, 0.46706614, 0.16341177, 0.        ],
       [1.        , 0.36011331, 0.46706614, 0.16341177, 0.        ],
       [1.        , 0.        , 0.        , 0.        , 1.        ],
       [1.        , 0.48758651, 0.37856345, 0.08455375, 0.        ],
       [1.        , 0.48758651, 0.37856345, 0.08455375, 0.        ],
       [1.        , 0.48758651, 0.37856345, 0.08455375, 0.        ],
       [1.        , 0.56530178, 0.19739645, 0.02130178, 0.        ],
       [1.        , 0.39454797, 0.04771909, 0.00187011, 0.        ],
       [1.        , 0.        , 0.        , 0.        , 0.        ]])

EDIT

By mistake, I ended up including here commits related to other topics. Now, it is also possible to pass response levels with index notation without having to use an explicit string. This doesn't work when the level contains spaces (must use a string there) or when the level is a number (use binary() function there). Examples:

This works:

"y['A'] ~ x1 + x2"
"y[A] ~ x1 + x2"
"y['My level contains spaces'] ~ x1 + x2"
"binary(y, 2)" ~ x1 + x2"

This does not work

"y[My level contains spaces] ~ x1 + x2"
"y[1] ~ x1 + x2"

opened by tomicapretto 2

Factor of group specific effect can be an interaction now

Originally, the group of a group-specific term could only be a single term, not an interaction. This PR aims to enable groups to be made of interactions. It looks like so far it is working, but there are some edge cases with the evaluation with new data that should be addressed.

This feature (problem) was requested (reported) here.

opened by tomicapretto 2
Nested stateful transformations don't work
We fail at parsing and resolving terms that are nested function calls. All the following alternatives work without any problem

design_matrices("y ~ f(x1)", data) design_matrices("y ~ f(arg = x1)", data) design_matrices("y ~ f(arg = 'blabla')", data) design_matrices("y ~ f(arg1= a, arg2 = "blabla")", data)

But nested function calls, like the following, don't work

design_matrices("y ~ f(g(x1))", data) design_matrices("y ~ f(arg = g(x1))", data) # ...etc

To solve this we need to update the Resolver.visitCallExpr() method as well as how the call term handles Expr.Call() instances, and maybe something else that I'm missing now.
bug
opened by tomicapretto 2
Parser generators

It looks like you're doing a lot of low-level parser generator work by hand. Have you considered using an existing pure Python parser package like rply? We've had a lot of success with that package in hy.

opened by brandonwillard 2

Enable deepcopy of bmb.Model

Hello !

I tried to deepcopy a bmb.Model and find out that we face a cannot pickle '_thread.lock' object. After investigation, this is due to saving of thread locked objects within formulae.Environment objects.

Indeed many, locked variables are fetched while calling Environment.capture(). My dummy workaround was to just replace the return cls([frame.f_locals, frame.f_globals])

        if isinstance(env, cls):
            return env
        elif isinstance(env, numbers.Integral):
            depth = env + reference
        else:
            raise TypeError("'env' must be either an integer or an instance of Environment.")
        frame = inspect.currentframe()
        try:
            for _ in range(depth + 1):
                if frame is None:
                    raise ValueError("call-stack is not that deep!")
                frame = frame.f_back
            # return cls([frame.f_locals, frame.f_globals])
            return cls([])
        finally:
            del frame

I'm not experienced enough on Bambi formulae, a good workaround would be to keep only what is necessary for formulae to work in output of the capture() function

I hope it would be helpful

opened by lt-brs 1

Support unknown test time group categories

I just tried a model with a different train/test split for the fit() and predict() functionality and hit the following error in the predict() call in bambi and hit the following error:

https://github.com/bambinos/formulae/blob/b926883a4ab744367953095a8540c8bf64ec2449/formulae/terms/variable.py#L223-L226

For a lot of real world use cases it's quite common for unknown categories to show up at inference time (e.g. new users, new customers etc) and it would be great to try and handle this natively in the predict() call such that a 'default' prediction can be made based on the mean across all groups rather than a specific group.

opened by markgoodhead 6
Version 0.3.3 crashes kernel with large dataset
Hi all,

Thank you for the work on this awesome package!

I am using bambi to fit a fairly complex model on a large dataset with a single group effect, which has many individuals. There are around 183,000 rows and 30,542 unique groups. Version 0.3.3 crashes Jupyter reliably when instantiating this design matrix. Interestingly, version 0.2.0 will instantiate it, with the caveat I can't include an interaction term I can in 0.3.3 within bambi (see bambi issue 495).

I've tried a few different approaches, and have found it will instantiate with about 25-50% of the data (using DataFrame.sample(frac=.25)), so it seems more an issue of sheer scale than anything else. I've also tried with Spyder, getting the same issue.

The code below will grab a modified dataset of the same size and structure I am using and set up the model design, which kills my kernel after a minute or two.

import pandas as pd from formulae import design_matrices trouble = pd.read_feather('https://osf.io/kw2xh/download') md = design_matrices('response ~ bs(time, degree=2, df=10) * state + state * level * trait + (1|pid)', data=trouble)

0.2.0 will return this after a little while, however. Any help is greatly appreciated, hopefully this issue isn't localised to my machine!
opened by alexjonesphd 7
Improve tests

Tests are quite complete, but they are messy and redundant as well. Also, there are a couple of things we're missing yet (see https://app.codecov.io/gh/bambinos/formulae/blob/master/formulae/terms/terms.py).
enhancement

opened by tomicapretto 0

Releases(v0.3.4)

v0.3.4(May 18, 2022)
Release done because we'll have a new Bambi release.

Fixed a bug in the levels of interaction terms involving numeric terms with multiple columns (b4a1f73)

Source code(tar.gz)
Source code(zip)
v0.3.3(Apr 14, 2022)

Fix order issues with categorical variables and custom encodings.
Source code(tar.gz)
Source code(zip)
v0.3.2(Apr 13, 2022)

Fix bug with categorical pandas series.
Source code(tar.gz)
Source code(zip)
v0.3.1(Apr 12, 2022)
Maintenance and fixes

Renamed ResponseVector to ResponseMatrix (#71)

Renamed design_vector to design_matrix in ResponseMatrix(#71)

Updated docstrings in formulae/matrices.py (#71)

Deprecation

Removed binary and success attributes from ResponseMatrix (#71)

Source code(tar.gz)
Source code(zip)
v0.3.0(Mar 11, 2022)
Changes

New features

We can create our own encodings such as Treatment or Sum encoding. These are subclasses of Encoding.

Added two aliases T and S that are shorthands of C(var, Treatment) and C(var, Sum) respectively.

DesignVector, CommonEffectsMatrix and GroupEffectsMatrix now retrieve their values when passed to np.array() and np.asarray().

Add poly() stateful transform.

na_action in design_matrices() can be "pass" which means not to drop or raise error about missing values. It just keeps them (#69)

design_matrices() gains a new argument extra_namespace. It allows us to pass a dictionary of transformations that are made available in the environment where the formula is evaluated (#70)

Maintenance and fixes

Fixed a bug in the addition of lower order terms when the original higher order term wasn't full-rank.

Columns for categorical terms are integer by default. They are casted to float only if there are other float-valued columns in the design matrix.

Updated str and repr methods of ResponseVector, CommonEffectsMatrix, and GroupEffectsMatrix.

Added str and repr methods for DesignMatrices.

Added get_item method for DesignMatrices.

Added support for comparison operators within function calls.

Source code(tar.gz)
Source code(zip)
v0.2.0(Sep 15, 2021)
Changes:

We have splines via bs().

We don't use Python's eval() anymore.

Minor bug fixes and internal changes.

Source code(tar.gz)
Source code(zip)
v0.1.4(Aug 9, 2021)

Release needed for Bambi 0.6.0
Source code(tar.gz)
Source code(zip)
v0.1.3(Aug 2, 2021)

Revert changes. This is the same as v0.1.1. The last release (v0.1.2) has some breaking changes that haven't been updated in Bambi yet.
Source code(tar.gz)
Source code(zip)
v0.1.2(Jul 30, 2021)

This release includes changes that will allow us to have binomial and multinomial families and offset terms.
Source code(tar.gz)
Source code(zip)
v0.1.1(Jun 18, 2021)

Fixes the shape of the group-specific effects matrix.
Source code(tar.gz)
Source code(zip)
v0.1.0(Jun 12, 2021)

Making a new release. I'm using 0.1.0 because, as far as I'm aware, there are no hidden bugs now. Also, I need this new release to merge a PR in Bambi that does fix tricky bugs.
Source code(tar.gz)
Source code(zip)
v0.0.10(May 10, 2021)

First release on Github. This release also tests if GitHub actions properly uploads the package to PyPI.
Source code(tar.gz)
Source code(zip)

Owner

GitHub Repository https://bambinos.github.io/formulae/

Bayesian Additive Regression Trees For Python

BartPy Introduction BartPy is a pure python implementation of the Bayesian additive regressions trees model of Chipman et al [1]. Reasons to use BART

187 Dec 16, 2022

ELI5 is a Python package which helps to debug machine learning classifiers and explain their predictions

A library for debugging/inspecting machine learning classifiers and explaining their predictions

154 Dec 17, 2022

Repositório para o #alurachallengedatascience1

1° Challenge de Dados - Alura A Alura Voz é uma empresa de telecomunicação que nos contratou para atuar como cientistas de dados na equipe de vendas.

16 Nov 10, 2022

Decision Weights in Prospect Theory

Decision Weights in Prospect Theory It's clear that humans are irrational, but how irrational are they? After some research into behavourial economics

32 Nov 08, 2021

Dual Adaptive Sampling for Machine Learning Interatomic potential.

DAS Dual Adaptive Sampling for Machine Learning Interatomic potential. How to cite If you use this code in your research, please cite this using: Hong

6 Jul 06, 2022

We have a dataset of user performances. The project is to develop a machine learning model that will predict the salaries of baseball players.

Salary-Prediction-with-Machine-Learning 1. Business Problem Can a machine learning project be implemented to estimate the salaries of baseball players

9 Oct 14, 2022

SmartSim makes it easier to use common Machine Learning (ML) libraries like PyTorch and TensorFlow

SmartSim makes it easier to use common Machine Learning (ML) libraries like PyTorch and TensorFlow, in High Performance Computing (HPC) simulations and workloads.

139 Jan 01, 2023

Conducted ANOVA and Logistic regression analysis using matplot library to visualize the result.

Intro-to-Data-Science Conducted ANOVA and Logistic regression analysis. Project ANOVA The main aim of this project is to perform One-Way ANOVA analysi

1 Feb 06, 2022

mlpack: a scalable C++ machine learning library --

4.2k Jan 01, 2023

Machine Learning for Time-Series with Python.Published by Packt

Machine-Learning-for-Time-Series-with-Python Become proficient in deriving insights from time-series data and analyzing a model’s performance Links Am

124 Dec 28, 2022

Tutorials, examples, collections, and everything else that falls into the categories: pattern classification, machine learning, and data mining

**Tutorials, examples, collections, and everything else that falls into the categories: pattern classification, machine learning, and data mining.** S

4k Dec 30, 2022

Projeto: Machine Learning: Linguagens de Programacao 2004-2001

Projeto: Machine Learning: Linguagens de Programacao 2004-2001 Projeto de Data Science e Machine Learning de análise de linguagens de programação de 2

0 Jun 29, 2021

Mosec is a high-performance and flexible model serving framework for building ML model-enabled backend and microservices

Mosec is a high-performance and flexible model serving framework for building ML model-enabled backend and microservices. It bridges the gap between any machine learning models you just trained and t

164 Jan 04, 2023

Management of exclusive GPU access for distributed machine learning workloads

TensorHive is an open source tool for managing computing resources used by multiple users across distributed hosts. It focuses on granting

131 Dec 12, 2022

A Python Module That Uses ANN To Predict A Stocks Price And Also Provides Accurate Technical Analysis With Many High Potential Implementations!

Stox A Module to predict the "close price" for the next day and give "technical analysis". It uses a Neural Network and the LSTM algorithm to predict

31 Dec 16, 2022

Automatically build ARIMA, SARIMAX, VAR, FB Prophet and XGBoost Models on Time Series data sets with a Single Line of Code. Now updated with Dask to handle millions of rows.

Auto_TS: Auto_TimeSeries Automatically build multiple Time Series models using a Single Line of Code. Now updated with Dask. Auto_timeseries is a comp

519 Jan 03, 2023

A simple and lightweight genetic algorithm for optimization of any machine learning model

geneticml This package contains a simple and lightweight genetic algorithm for optimization of any machine learning model. Installation Use pip to ins

8 Aug 10, 2022

A classification model capable of accurately predicting the price of secondhand cars

The purpose of this project is create a classification model capable of accurately predicting the price of secondhand cars. The data used for model building is open source and has been added to this

2 Sep 13, 2022

Contains an implementation (sklearn API) of the algorithm proposed in "GENDIS: GEnetic DIscovery of Shapelets" and code to reproduce all experiments.

GENDIS GENetic DIscovery of Shapelets In the time series classification domain, shapelets are small subseries that are discriminative for a certain cl

90 Oct 28, 2022

A linear equation solver using gaussian elimination. Implemented for fun and learning/teaching.

A linear equation solver using gaussian elimination. Implemented for fun and learning/teaching. The solver will solve equations of the type: A can be

3 Feb 15, 2022

Formulae is a Python library that implements Wilkinson's formulas for mixed-effects models.

Related tags

Overview

formulae

Installation

Documentation

Notes

Comments

Releases(v0.3.4)

v0.3.4(May 18, 2022)

v0.3.3(Apr 14, 2022)

v0.3.2(Apr 13, 2022)

v0.3.1(Apr 12, 2022)

Maintenance and fixes

Deprecation

v0.3.0(Mar 11, 2022)

Changes

New features