lightweight python wrapper for vowpal wabbit

Overview

vowpal_porpoise

Lightweight python wrapper for vowpal_wabbit.

Why: Scalable, blazingly fast machine learning.

Install

  1. Install vowpal_wabbit. Clone and run make
  2. Install cython. pip install cython
  3. Clone vowpal_porpoise
  4. Run: python setup.py install to install.

Now can you do: import vowpal_porpoise from python.

Examples

Standard Interface

Linear regression with l1 penalty:

from vowpal_porpoise import VW

# Initialize the model
vw = VW(moniker='test',    # a name for the model
        passes=10,         # vw arg: passes
        loss='quadratic',  # vw arg: loss
        learning_rate=10,  # vw arg: learning_rate
        l1=0.01)           # vw arg: l1

# Inside the with training() block a vw process will be 
# open to communication
with vw.training():
    for instance in ['1 |big red square',\
                      '0 |small blue circle']:
        vw.push_instance(instance)

    # here stdin will close
# here the vw process will have finished

# Inside the with predicting() block we can stream instances and 
# acquire their labels
with vw.predicting():
    for instance in ['1 |large burnt sienna rhombus',\
                      '0 |little teal oval']:
        vw.push_instance(instance)

# Read the predictions like this:
predictions = list(vw.read_predictions_())

L-BFGS with a rank-5 approximation:

from vowpal_porpoise import VW

# Initialize the model
vw = VW(moniker='test_lbfgs', # a name for the model
        passes=10,            # vw arg: passes
        lbfgs=True,           # turn on lbfgs
        mem=5)                # lbfgs rank

Latent Dirichlet Allocation with 100 topics:

from vowpal_porpoise import VW

# Initialize the model
vw = VW(moniker='test_lda',  # a name for the model
        passes=10,           # vw arg: passes
        lda=100,             # turn on lda
        minibatch=100)       # set the minibatch size

Scikit-learn Interface

vowpal_porpoise also ships with an interface into scikit-learn, which allows awesome experiment-level stuff like cross-validation:

from sklearn.cross_validation import StratifiedKFold
from sklearn.grid_search import GridSearchCV
from vowpal_porpoise.sklearn import VW_Classifier

GridSearchCV(
        VW_Classifier(loss='logistic', moniker='example_sklearn',
                      passes=10, silent=True, learning_rate=10),
        param_grid=parameters,
        score_func=f1_score,
        cv=StratifiedKFold(y_train),
).fit(X_train, y_train)

Check out example_sklearn.py for more details

Library Interace (DISABLED as of 2013-08-12)

Via the VW interface:

with vw.predicting_library():
    for instance in ['1 |large burnt sienna rhombus', \
                      '1 |little teal oval']:
        prediction = vw.push_instance(instance)

Now the predictions are returned directly to the parent process, rather than having to read from disk. See examples/example1.py for more details.

Alternatively you can use the raw library interface:

import vw_c
vw = vw_c.VW("--loss=quadratic --l1=0.01 -f model")
vw.learn("1 |this is a positive example")
vw.learn("0 |this is a negative example")
vw.finish()

Currently does not support passes due to some limitations in the underlying vw C code.

Need more examples?

  • example1.py: SimpleModel class wrapper around VP (both standard and library flavors)
  • example_library.py: Demonstrates the low-level vw library wrapper, classifying lines of alice in wonderland vs through the looking glass.

Why

vowpal_wabbit is insanely fast and scalable. vowpal_porpoise is slower, but only during the initial training pass. Once the data has been properly cached it will idle while vowpal_wabbit does all the heavy lifting. Furthermore, vowpal_porpoise was designed to be lightweight and not to get in the way of vowpal_wabbit's scalability, e.g. it allows distributed learning via --nodes and does not require data to be batched in memory. In our research work we use vowpal_porpoise on an 80-node cluster running over multiple terabytes of data.

The main benefit of vowpal_porpoise is allowing rapid prototyping of new models and feature extractors. We found that we had been doing this in an ad-hoc way using python scripts to shuffle around massive gzipped text files, so we just closed the loop and made vowpal_wabbit a python library.

How it works

Wraps the vw binary in a subprocess and uses stdin to push data, temporary files to pull predictions. Why not use the prediction labels vw provides on stdout? It turns out that the python GIL basically makes streamining in and out of a process (even asynchronously) painfully difficult. If you know of a clever way to get around this, please email me. In other languages (e.g. in a forthcoming scala wrapper) this is not an issue.

Alternatively, you can use a pure api call (vw_c, wrapping libvw) for prediction.

Contact

Joseph Reisinger @josephreisinger

Contributors

License

Apache 2.0

Comments
  • Issue with example1.py

    Issue with example1.py

    Hi, guys!

    When I run example1.py it raises exeception. """ [email protected]:~/vowpal_porpoise/examples$ python example1.py example1: training [DEBUG] No existing model file or not options.incremental [DEBUG] Running command: "vw --learning_rate=15.000000 --power_t=1.000000 --passes 10 --cache_file /home/kolesman/vowpal_porpoise/examples/example1.cache -f /home/kolesman/vowpal_porpoise/examples/example1.model" done streaming. final_regressor = /home/kolesman/vowpal_porpoise/examples/example1.model Num weight bits = 18 learning rate = 15 initial_t = 0 power_t = 1 decay_learning_rate = 1 creating cache_file = /home/kolesman/vowpal_porpoise/examples/example1.cache Reading datafile = num sources = 1 average since example example current current current loss last counter weight label predict features 0.360904 0.360904 3 3.0 1.0000 0.7933 5 0.266263 0.171622 6 6.0 0.0000 0.2465 5 -nan -nan 11 11.0 0.0000 0.0000 5 h -nan -nan 22 22.0 0.0000 0.0000 5 h -nan -nan 44 44.0 1.0000 1.0000 5 h Traceback (most recent call last): File "example1.py", line 86, in for (instance, prediction) in SimpleModel('example1').train(instances).predict(instances): File "example1.py", line 44, in train print 'done streaming.' File "/usr/lib/python2.7/contextlib.py", line 24, in exit self.gen.next() File "/usr/local/lib/python2.7/dist-packages/vowpal_porpoise-0.3-py2.7.egg/vowpal_porpoise/vw.py", line 167, in training self.close_process() File "/usr/local/lib/python2.7/dist-packages/vowpal_porpoise-0.3-py2.7.egg/vowpal_porpoise/vw.py", line 203, in close_process (self.vw_process.pid, self.vw_process.command, self.vw_process.returncode)) Exception: vw_process 22007 (vw --learning_rate=15.000000 --power_t=1.000000 --passes 10 --cache_file /home/kolesman/vowpal_porpoise/examples/example1.cache -f /home/kolesman/vowpal_porpoise/examples/example1.model) exited abnormally with return code -11 """

    Do you have any ideas what is the source of problem?

    opened by kolesman 2
  • Make tagged VW data work

    Make tagged VW data work

    For whatever reason, when the VW data is tagged, the parser barfs on reading the prediction file because it gets the prediction value and the tag back. This fixes it for me.

    opened by mswimmer 1
  • Added support for nn (single layer) in sklearn interface

    Added support for nn (single layer) in sklearn interface

    Adding support for nn to be called from the wrapper.

    [DEBUG] Running command: "vw --learning_rate=5.000000 --l2=0.000010 --oaa=10 --nn=4 --passes 10 --cache_file /home/vvkulkarni/vowpal_porpoise/examples/example_sklearn.cache -f /home/vvkulkarni/vowpal_porpoise/examples/example_sklearn.model" [DEBUG] Running command: "vw --learning_rate=5.000000 --l2=0.000010 --oaa=10 --nn=4 -t -i /home/vvkulkarni/vowpal_porpoise/examples/example_sklearn.model -p /home/vvkulkarni/vowpal_porpoise/examples/example_sklearn.predictionecqcFA" Confusion Matrix: [[34 0 0 0 1 0 0 0 0 0] [ 0 29 0 0 0 0 0 0 0 7] [ 0 0 35 0 0 0 0 0 0 0] [ 0 0 0 24 0 4 0 3 6 0] [ 0 0 0 0 34 0 0 0 3 0] [ 0 0 0 0 0 37 0 0 0 0] [ 0 0 0 0 0 0 37 0 0 0] [ 0 0 0 1 0 0 0 32 2 1] [ 0 1 0 0 0 1 0 1 30 0] [ 0 0 0 0 0 2 0 1 3 31]] 0.89717036724 Adding @aboSamoor (as he is interested in this CL too)

    opened by viveksck 1
  • Encode Cython as a setup-time dependency of vowpal porpoise

    Encode Cython as a setup-time dependency of vowpal porpoise

    Encoding Cython as a setup-time dependency makes it much easer to use vowpal porpoise in nicely packaged distributions.

    Without Cython as a setup-time dependency, you might have a requirements.txt with these lines: Cython git+http://github.com/josephreisinger/vowpal_porpoise.git#egg=vowpal_porpoise and try to execute "pip install -r requirements.txt" (or, for instance, push to Heroku and expect it to do so).

    Unfortunately the installation process for Cython will not be completed before vowpal porpoise needs it. By specifying Cython as a setup-time dependency in the vowpal porpoise setup.py, Cython will be downloaded and available before it is needed, and you don't have to specify it as a dependency elsewhere. Using my modified setup.py, I can now run "pip install git+http://github.com/josephreisinger/vowpal_porpoise.git#egg=vowpal_porpoise" without any mention of Cython.

    opened by mattbornski 0
  • Update example_sklearn.py

    Update example_sklearn.py

    y must be a binary list, otherwise will result in an error:

    [DEBUG] No existing model file or not options.incremental
    [DEBUG] Running command: "vw --learning_rate=10.000000 --l2=0.000010 --loss_function=logistic --passes 10 --cache_file /Users/datle/Desktop/example_sklearn.cache -f /Users/datle/Desktop/example_sklearn.model"
    [DEBUG] Running command: "vw --learning_rate=10.000000 --l2=0.000010 --loss_function=logistic -t -i /Users/datle/Desktop/example_sklearn.model -p /Users/datle/Desktop/example_sklearn.predictiond9d1DV"
    Traceback (most recent call last):
      File "test.py", line 72, in <module>
        main()
      File "test.py", line 58, in main
        ).fit(X_train, y_train)
      File "/Library/Python/2.7/site-packages/sklearn/grid_search.py", line 732, in fit
        return self._fit(X, y, ParameterGrid(self.param_grid))
      File "/Library/Python/2.7/site-packages/sklearn/grid_search.py", line 505, in _fit
        for parameters in parameter_iterable
      File "/Library/Python/2.7/site-packages/sklearn/externals/joblib/parallel.py", line 659, in __call__
        self.dispatch(function, args, kwargs)
      File "/Library/Python/2.7/site-packages/sklearn/externals/joblib/parallel.py", line 406, in dispatch
        job = ImmediateApply(func, args, kwargs)
      File "/Library/Python/2.7/site-packages/sklearn/externals/joblib/parallel.py", line 140, in __init__
        self.results = func(*args, **kwargs)
      File "/Library/Python/2.7/site-packages/sklearn/cross_validation.py", line 1478, in _fit_and_score
        test_score = _score(estimator, X_test, y_test, scorer)
      File "/Library/Python/2.7/site-packages/sklearn/cross_validation.py", line 1534, in _score
        score = scorer(estimator, X_test, y_test)
      File "/Library/Python/2.7/site-packages/sklearn/metrics/scorer.py", line 201, in _passthrough_scorer
        return estimator.score(*args, **kwargs)
      File "/Library/Python/2.7/site-packages/sklearn/base.py", line 295, in score
        return accuracy_score(y, self.predict(X), sample_weight=sample_weight)
      File "/Library/Python/2.7/site-packages/sklearn/metrics/classification.py", line 179, in accuracy_score
        y_type, y_true, y_pred = _check_targets(y_true, y_pred)
      File "/Library/Python/2.7/site-packages/sklearn/metrics/classification.py", line 84, in _check_targets
        "".format(type_true, type_pred))
    ValueError: Can't handle mix of binary and continuous
    
    opened by lenguyenthedat 0
  • Can't run example 1

    Can't run example 1

    Hi If I try to run example1 after installing everything, I get the following error:

    File "example1.py", line 86, in <module>
        for (instance, prediction) in SimpleModel('example1').train(instances).predict(instances):
      File "example1.py", line 37, in train
        with self.model.training():
      File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/contextlib.py", line 17, in __enter__
        return self.gen.next()
      File "build/bdist.macosx-10.9-intel/egg/vowpal_porpoise/vw.py", line 168, in training
      File "build/bdist.macosx-10.9-intel/egg/vowpal_porpoise/vw.py", line 194, in start_training
      File "build/bdist.macosx-10.9-intel/egg/vowpal_porpoise/vw.py", line 266, in make_subprocess
      File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/subprocess.py", line 711, in __init__
        errread, errwrite)
      File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/subprocess.py", line 1308, in _execute_child
        raise child_exception
    OSError: [Errno 2] No such file or directory
    

    I'd hazard a guess the cache file is not getting created. Please help?

    opened by Kaydeeb0y 0
  • Doesn't work on ipython notebook

    Doesn't work on ipython notebook

    I'm trying to use vowpal porpoise from my Ipython Notebook web interface Running this code:

    from vowpal_porpoise import VW
    vw = VW(vw='vw_new', 
       passes=2,
       moniker='log_train.vw', 
       loss='logistic')
    with vw.training():
        pass
    

    I get this:

    ---------------------------------------------------------------------------
    UnsupportedOperation                      Traceback (most recent call last)
    <ipython-input-7-39be08ecca54> in <module>()
          3    moniker='log_train.vw',
          4    loss='logistic')
    ----> 5 with vw.training():
          6     pass
    
    /usr/lib/python2.7/contextlib.pyc in __enter__(self)
         15     def __enter__(self):
         16         try:
    ---> 17             return self.gen.next()
         18         except StopIteration:
         19             raise RuntimeError("generator didn't yield")
    
    /usr/local/lib/python2.7/dist-packages/vowpal_porpoise-0.3-py2.7.egg/vowpal_porpoise/vw.pyc in training(self)
        166     @contextmanager
        167     def training(self):
    --> 168         self.start_training()
        169         yield
        170         self.close_process()
    
    /usr/local/lib/python2.7/dist-packages/vowpal_porpoise-0.3-py2.7.egg/vowpal_porpoise/vw.pyc in start_training(self)
        192 
        193         # Run the actual training
    --> 194         self.vw_process = self.make_subprocess(self.vw_train_command(cache_file, model_file))
        195 
        196         # set the instance pusher
    
    /usr/local/lib/python2.7/dist-packages/vowpal_porpoise-0.3-py2.7.egg/vowpal_porpoise/vw.pyc in make_subprocess(self, command)
        264             stderr.write(command + '\n')
        265         self.log.debug('Running command: "%s"' % str(command))
    --> 266         result = subprocess.Popen(shlex.split(str(command)), stdin=subprocess.PIPE, stdout=stdout, stderr=stderr, close_fds=True, universal_newlines=True)
        267         result.command = command
        268         return result
    
    /usr/lib/python2.7/subprocess.pyc in __init__(self, args, bufsize, executable, stdin, stdout, stderr, preexec_fn, close_fds, shell, cwd, env, universal_newlines, startupinfo, creationflags)
        670         (p2cread, p2cwrite,
        671          c2pread, c2pwrite,
    --> 672          errread, errwrite) = self._get_handles(stdin, stdout, stderr)
        673 
        674         self._execute_child(args, executable, preexec_fn, close_fds,
    
    /usr/lib/python2.7/subprocess.pyc in _get_handles(self, stdin, stdout, stderr)
       1063             else:
       1064                 # Assuming file-like object
    -> 1065                 errwrite = stderr.fileno()
       1066 
       1067             return (p2cread, p2cwrite,
    
    /usr/local/lib/python2.7/dist-packages/IPython/kernel/zmq/iostream.pyc in fileno(self)
        192 
        193     def fileno(self):
    --> 194         raise UnsupportedOperation("IOStream has no fileno.")
        195 
        196     def write(self, string):
    
    UnsupportedOperation: IOStream has no fileno.
    
    
    opened by khalman-m 0
  • Make input format for cross validation consistent with that of VW

    Make input format for cross validation consistent with that of VW

    First of all, this is a great wrapper! It was very nice to see the linear regression with l1 penalty example take input in the VW format. However, it would be great for beginners like me to have a similar example for getting the GridSearch to work with VW.

    opened by Legend 0
  • GridSearchCV with n_jobs > 1 (Parallelized) with VW classfier results in a Broken Pipe error

    GridSearchCV with n_jobs > 1 (Parallelized) with VW classfier results in a Broken Pipe error

    127         with self.vw_.training():
    128             for instance in examples:
    129                 self.vw_.push_instance(instance) <-----
    130 
    131         # learning done after "with" statement
    132         return self
    133 
    

    ........................................................................... /usr/local/lib/python2.7/dist-packages/vowpal_porpoise-0.3-py2.7.egg/vowpal_porpoise/vw.pyc in push_instance_stdin(self=<vowpal_porpoise.vw.VW instance>, instance='2 | 42:2.000000 29:16.000000 60:13.000000 61:16....3:16.000000 52:16.000000 33:7.000000 37:16.000000') 204 if self.vw_process.wait() != 0: 205 raise Exception("vw_process %d (%s) exited abnormally with return code %d" %
    206 (self.vw_process.pid, self.vw_process.command, self.vw_process.returncode)) 207 208 def push_instance_stdin(self, instance): --> 209 self.vw_process.stdin.write(('%s\n' % instance).encode('utf8')) 210 211 def start_predicting(self): 212 model_file = self.get_model_file() 213 # Be sure that the prediction file has a unique filename, since many processes may try to

    IOError: [Errno 32] Broken pipe

    To reproduce: Just pass the parameter n_jobs = 10 to GridSearchCV in example_sklearn.py

    opened by viveksck 0
Releases(0.3)
The final project for "Applying AI to Wearable Device Data" course from "AI for Healthcare" - Udacity.

Motion Compensated Pulse Rate Estimation Overview This project has 2 main parts. Develop a Pulse Rate Algorithm on the given training data. Then Test

Omar Laham 2 Oct 25, 2022
Local-Global Stratified Transformer for Efficient Video Recognition

DualFormer This repo is the implementation of our manuscript entitled "Local-Global Stratified Transformer for Efficient Video Recognition". Our model

Sea AI Lab 19 Dec 07, 2022
PyTorch implementation of "MLP-Mixer: An all-MLP Architecture for Vision" Tolstikhin et al. (2021)

mlp-mixer-pytorch PyTorch implementation of "MLP-Mixer: An all-MLP Architecture for Vision" Tolstikhin et al. (2021) Usage import torch from mlp_mixer

isaac 27 Jul 09, 2022
YOLO-v5 기반 단안 카메라의 영상을 활용해 차간 거리를 일정하게 유지하며 주행하는 Adaptive Cruise Control 기능 구현

자율 주행차의 영상 기반 차간거리 유지 개발 Table of Contents 프로젝트 소개 주요 기능 시스템 구조 디렉토리 구조 결과 실행 방법 참조 팀원 프로젝트 소개 YOLO-v5 기반으로 단안 카메라의 영상을 활용해 차간 거리를 일정하게 유지하며 주행하는 Adap

14 Jun 29, 2022
🔮 Execution time predictions for deep neural network training iterations across different GPUs.

Habitat: A Runtime-Based Computational Performance Predictor for Deep Neural Network Training Habitat is a tool that predicts a deep neural network's

Geoffrey Yu 44 Dec 27, 2022
naked is a Python tool which allows you to strip a model and only keep what matters for making predictions.

naked is a Python tool which allows you to strip a model and only keep what matters for making predictions. The result is a pure Python function with no third-party dependencies that you can simply c

Max Halford 24 Dec 20, 2022
Implementation of the federated dual coordinate descent (FedDCD) method.

FedDCD.jl Implementation of the federated dual coordinate descent (FedDCD) method. Installation To install, just call Pkg.add("https://github.com/Zhen

Zhenan Fan 6 Sep 21, 2022
You are AllSet: A Multiset Function Framework for Hypergraph Neural Networks.

AllSet This is the repo for our paper: You are AllSet: A Multiset Function Framework for Hypergraph Neural Networks. We prepared all codes and a subse

Jianhao 51 Dec 24, 2022
The pytorch implementation of the paper "text-guided neural image inpainting" at MM'2020

TDANet: Text-Guided Neural Image Inpainting, MM'2020 (Oral) MM | ArXiv This repository implements the paper "Text-Guided Neural Image Inpainting" by L

LisaiZhang 75 Dec 22, 2022
Leaf: Multiple-Choice Question Generation

Leaf: Multiple-Choice Question Generation Easy to use and understand multiple-choice question generation algorithm using T5 Transformers. The applicat

Kristiyan Vachev 62 Dec 20, 2022
Jittor implementation of Recursive-NeRF: An Efficient and Dynamically Growing NeRF

Recursive-NeRF: An Efficient and Dynamically Growing NeRF This is a Jittor implementation of Recursive-NeRF: An Efficient and Dynamically Growing NeRF

33 Nov 30, 2022
Official implementation of "SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers"

SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers Figure 1: Performance of SegFormer-B0 to SegFormer-B5. Project page

NVIDIA Research Projects 1.4k Dec 31, 2022
3D Multi-Person Pose Estimation by Integrating Top-Down and Bottom-Up Networks

3D Multi-Person Pose Estimation by Integrating Top-Down and Bottom-Up Networks Introduction This repository contains the code and models for the follo

124 Jan 06, 2023
Tightness-aware Evaluation Protocol for Scene Text Detection

TIoU-metric Release on 27/03/2019. This repository is built on the ICDAR 2015 evaluation code. If you propose a better metric and require further eval

Yuliang Liu 206 Nov 18, 2022
Repository features UNet inspired architecture used for segmenting lungs on chest X-Ray images

Lung Segmentation (2D) Repository features UNet inspired architecture used for segmenting lungs on chest X-Ray images. Demo See the application of the

163 Sep 21, 2022
The MLOps platform for innovators 🚀

​ DS2.ai is an integrated AI operation solution that supports all stages from custom AI development to deployment. It is an AI-specialized platform service that collects data, builds a training datas

9 Jan 03, 2023
Test-Time Personalization with a Transformer for Human Pose Estimation, NeurIPS 2021

Transforming Self-Supervision in Test Time for Personalizing Human Pose Estimation This is an official implementation of the NeurIPS 2021 paper: Trans

41 Nov 28, 2022
Scripts of Machine Learning Algorithms from Scratch. Implementations of machine learning models and algorithms using nothing but NumPy with a focus on accessibility. Aims to cover everything from basic to advance.

Algo-ScriptML Python implementations of some of the fundamental Machine Learning models and algorithms from scratch. The goal of this project is not t

Algo Phantoms 81 Nov 26, 2022
A solution to ensure Crowd Management with Contactless and Safe systems.

CovidTrack A Solution to ensure Crowd Management with Contactless and Safe systems. ML Model Mask Detection Social Distancing Detection Analytics Page

Om Khare 1 Nov 10, 2021
Official implementation of CVPR2020 paper "Deep Generative Model for Robust Imbalance Classification"

Deep Generative Model for Robust Imbalance Classification Deep Generative Model for Robust Imbalance Classification Xinyue Wang, Yilin Lyu, Liping Jin

9 Nov 01, 2022