A torch.Tensor-like DataFrame library supporting multiple execution runtimes and Arrow as a common memory format

Last update: Jan 06, 2023

Overview

TorchArrow (Warning: Unstable Prototype)

This is a prototype library currently under heavy development. It does not currently have stable releases, and as such will likely be modified significantly in backwards compatibility breaking ways until beta release (targeting early 2022). If you have suggestions on the API or use cases you would like to be covered, please open a GitHub issue. We would love to hear thoughts and feedback.

TorchArrow is a torch.Tensor-like Python DataFrame library for data preprocessing in deep learning. It supports multiple execution runtimes and Arrow as a common format.

It plans to provide:

Python Dataframe library implementing streaming-friendly Pandas subset
Seamless handoff with PyTorch or other model authoring, such as Tensor collation and easily plugging into PyTorch DataLoader and DataPipes
Zero copy for external readers via Arrow in-memory columnar format
High-performance CPU backend via Velox
GPU backend via libcudf
High-performance C++ UDF support with vectorization

Installation

Binaries

Coming soon!

From Source

If you are installing from source, you will need Python 3.8 or later and a C++17 compiler. Also, we highly recommend installing an Miniconda environment.

Get the TorchArrow Source

git clone --recursive https://github.com/facebookresearch/torcharrow
cd torcharrow
# if you are updating an existing checkout
git submodule sync --recursive
git submodule update --init --recursive

Install Dependencies

On MacOS

HomeBrew is required to install development tools on MacOS.

# Install dependencies from Brew
brew install --formula ninja cmake ccache protobuf icu4c boost gflags glog libevent lz4 lzo snappy xz zstd

# Build and install other dependencies
scripts/build_mac_dep.sh ranges_v3 googletest fmt double_conversion folly re2

On Ubuntu (20.04 or later)

# Install dependencies from APT
apt install -y g++ cmake ccache ninja-build checkinstall \
    libssl-dev libboost-all-dev libdouble-conversion-dev libgoogle-glog-dev \
    libbz2-dev libgflags-dev libgtest-dev libgmock-dev libevent-dev libfmt-dev \
    libprotobuf-dev liblz4-dev libzstd-dev libre2-dev libsnappy-dev liblzo2-dev \
    protobuf-compiler
# Build and install Folly
scripts/install_ubuntu_folly.sh

Install TorchArrow

For local development, you can build with debug mode:

DEBUG=1 python setup.py develop

And run unit tests with

python -m unittest -v

To install TorchArrow with release mode (WARNING: may take very long to build):

python setup.py install

Documentation

This 10 minutes tutorial provides a short introduction to TorchArrow. More documents on advanced topics are coming soon!

Future Plans

We hope to sufficiently expand the library, harden APIs, and gather feedback to enable a beta release at the time of the PyTorch 1.11 release (early 2022).

License

TorchArrow is BSD licensed, as found in the LICENSE file.

Comments

Automated submodule update: velox

This is an automated pull request to update the first-party submodule for facebookincubator/velox.

New submodule commit: https://github.com/facebookincubator/velox/commit/95b09c7bad6baa93d8f6add4562dfe0cdc8c26cd

Test Plan: Ensure that CI jobs succeed on GitHub before landing.
CLA Signed

opened by facebook-github-bot 80
Automated submodule update: velox

This is an automated pull request to update the first-party submodule for facebookincubator/velox.

New submodule commit: https://github.com/facebookincubator/velox/commit/0cb50b9fdfccbb277e62e1c2541ae084b29d6080

Test Plan: Ensure that CI jobs succeed on GitHub before landing.
CLA Signed

opened by facebook-github-bot 79
Automated submodule update: velox

This is an automated pull request to update the first-party submodule for facebookincubator/velox.

New submodule commit: https://github.com/facebookincubator/velox/commit/e432b1df0be62f65e0ba00b4fb966605bcb1443e

Test Plan: Ensure that CI jobs succeed on GitHub before landing.
CLA Signed

opened by facebook-github-bot 58
Automated submodule update: velox

This is an automated pull request to update the first-party submodule for facebookincubator/velox.

New submodule commit: https://github.com/facebookincubator/velox/commit/7673b382d909add4738240ae0157f2d5cafcf546

Test Plan: Ensure that CI jobs succeed on GitHub before landing.
CLA Signed

opened by facebook-github-bot 56
Automated submodule update: velox

This is an automated pull request to update the first-party submodule for facebookincubator/velox.

New submodule commit: https://github.com/facebookincubator/velox/commit/5e37e22c974fcd9caceb3dd97a0e84386d188474

Test Plan: Ensure that CI jobs succeed on GitHub before landing.
CLA Signed

opened by facebook-github-bot 55
Automated submodule update: velox

This is an automated pull request to update the first-party submodule for facebookincubator/velox.

New submodule commit: https://github.com/facebookincubator/velox/commit/08be6833961213b6679a7a7707ca53d486ff84df

Test Plan: Ensure that CI jobs succeed on GitHub before landing.
CLA Signed

opened by facebook-github-bot 46
Automated submodule update: velox

This is an automated pull request to update the first-party submodule for facebookincubator/velox.

New submodule commit: https://github.com/facebookincubator/velox/commit/41971b30c1cdd9f984018d6a496bc3b99afc7b45

Test Plan: Ensure that CI jobs succeed on GitHub before landing.
CLA Signed

opened by facebook-github-bot 36
Automated submodule update: velox

This is an automated pull request to update the first-party submodule for facebookincubator/velox.

New submodule commit: https://github.com/facebookincubator/velox/commit/4a36551237993f519dbf5bbae70a4ac9a660bf0d

Test Plan: Ensure that CI jobs succeed on GitHub before landing.
CLA Signed

opened by facebook-github-bot 35
Automated submodule update: velox

This is an automated pull request to update the first-party submodule for facebookincubator/velox.

New submodule commit: https://github.com/facebookincubator/velox/commit/0e4d3a5efece59b7d6a6a8f23ed7e4668d078e2e

Test Plan: Ensure that CI jobs succeed on GitHub before landing.
CLA Signed

opened by facebook-github-bot 33
Automated submodule update: velox

This is an automated pull request to update the first-party submodule for facebookincubator/velox.

New submodule commit: https://github.com/facebookincubator/velox/commit/fb7b62bede0beb66cf87fe888d71a9c366fe5ed6

Test Plan: Ensure that CI jobs succeed on GitHub before landing.
CLA Signed

opened by facebook-github-bot 33
Automated submodule update: velox

This is an automated pull request to update the first-party submodule for facebookincubator/velox.

New submodule commit: https://github.com/facebookincubator/velox/commit/b32878fb54eefb01c0c577439d0d6d61644dcff9

Test Plan: Ensure that CI jobs succeed on GitHub before landing.
CLA Signed

opened by facebook-github-bot 33
Stable Release Roadmap

Hello, I see that the development of the library has slowed down a bit, hence I would like to ask if there exists a roadmap for the first stable release or if there's any other plan for TorchArrow. Thank you very much for your work!

opened by mbignotti 0
`from_arrow` with `List` columns
Summary: Adds some basic functionality to allow Arrow tables/arrays with List[primitive_type] columns to be converted to a ta.Dataframe.

Implemented by converting the list column to a pylist and wrapping _from_pysequence. Not super efficient, but provides some functionality to unblock these columns.

Tests: Modified previous test case that checked for unsupported type. python -m unittest -v

---------------------------------------------------------------------- Ran 196 tests in 1.108s OK
CLA Signed
opened by myzha0 0
Generalize Dispatcher class

Summary: Generalizing it for reusing it in different contexts. Also changing to global-instance-as-singleton pattern, so that we can instantiate instances for different use cases without worrying about mis-sharing the class variable calls

Differential Revision: D40188963
CLA Signed fb-exported

opened by OswinC 1

Support for arrays in torcharrow.from_arrow

Hi guys! When trying to use ParquetDataFrameLoader I ran across a problem when trying to load parquet file if it has an array field. It looks like it comes down to torcharrow.from_arrow not supporting array columns. But it seems that torcharrow already has support for array columns. Are there any plans to implement it when loading from parquet files or are there any problems which stop this from being implemented?

The error basically looks like this:

NotImplementedError                       Traceback (most recent call last)
Input In [25], in <cell line: 1>()
----> 1 next(iter(datapipe))

File /opt/conda/lib/python3.8/site-packages/torch/utils/data/datapipes/_typing.py:514, in hook_iterator.<locals>.wrap_generator(*args, **kwargs)
    512         response = gen.send(None)
    513 else:
--> 514     response = gen.send(None)
    516 while True:
    517     request = yield response

File /opt/conda/lib/python3.8/site-packages/torch/utils/data/datapipes/iter/combinatorics.py:127, in ShufflerIterDataPipe.__iter__(self)
    125 self._rng.seed(self._seed)
    126 self._seed = None
--> 127 for x in self.datapipe:
    128     if len(self._buffer) == self.buffer_size:
    129         idx = self._rng.randint(0, len(self._buffer) - 1)

File /opt/conda/lib/python3.8/site-packages/torch/utils/data/datapipes/_typing.py:514, in hook_iterator.<locals>.wrap_generator(*args, **kwargs)
    512         response = gen.send(None)
    513 else:
--> 514     response = gen.send(None)
    516 while True:
    517     request = yield response

File /opt/conda/lib/python3.8/site-packages/torchdata/datapipes/iter/util/dataframemaker.py:138, in ParquetDFLoaderIterDataPipe.__iter__(self)
    135 for i in range(num_row_groups):
    136     # TODO: More fine-grain control over the number of rows or row group per DataFrame
    137     row_group = parquet_file.read_row_group(i, columns=self.columns, use_threads=self.use_threads)
--> 138     yield torcharrow.from_arrow(row_group, dtype=self.dtype)

File /opt/conda/lib/python3.8/site-packages/torcharrow/interop.py:32, in from_arrow(data, dtype, device)
     30     return _from_arrow_array(data, dtype, device=device)
     31 elif isinstance(data, pa.Table):
---> 32     return _from_arrow_table(data, dtype, device=device)
     33 else:
     34     raise ValueError

File /opt/conda/lib/python3.8/site-packages/torcharrow/interop_arrow.py:86, in _from_arrow_table(table, dtype, device)
     83     field = table.schema.field(i)
     85     assert len(table[i].chunks) == 1
---> 86     df_data[field.name] = _from_arrow_array(
     87         table[i].chunk(0),
     88         dtype=(
     89             # pyre-fixme[16]: `DType` has no attribute `get`.
     90             dtype.get(field.name)
     91             if dtype is not None
     92             else _arrowtype_to_dtype(field.type, field.nullable)
     93         ),
     94         device=device,
     95     )
     97 return dataframe(df_data, device=device)

File /opt/conda/lib/python3.8/site-packages/torcharrow/interop_arrow.py:37, in _from_arrow_array(array, dtype, device)
     28 assert isinstance(array, pa.Array)
     30 # Using the most narrow type we can, we (i) don't restrict in any
     31 # way where it can be used (since we can pass a narrower typed
     32 # non-null column to a function expecting a nullable type, but not
   (...)
     35 # increase the amount of places we can use the from_arrow result
     36 # pyre-fixme[16]: `Array` has no attribute `type`.
---> 37 dtype_from_arrowtype = _arrowtype_to_dtype(array.type, array.null_count > 0)
     38 if dtype and (
     39     dt.get_underlying_dtype(dtype) != dt.get_underlying_dtype(dtype_from_arrowtype)
     40 ):
     41     raise NotImplementedError("Type casting is not supported")

File /opt/conda/lib/python3.8/site-packages/torcharrow/_interop.py:205, in _arrowtype_to_dtype(t, nullable)
    199 if pa.types.is_struct(t):
    200     return dt.Struct(
    201         # pyre-fixme[16]: `DataType` has no attribute `__iter__`.
    202         [dt.Field(f.name, _arrowtype_to_dtype(f.type, f.nullable)) for f in t],
    203         nullable,
    204     )
--> 205 raise NotImplementedError(f"Unsupported Arrow type: {str(t)}")

NotImplementedError: Unsupported Arrow type: list<element: float>
This exception is thrown by __iter__ of ParquetDFLoaderIterDataPipe()

opened by grapefroot 2

Releases(v0.1.0)

v0.1.0(Jul 13, 2022)
We are excited to release the very first Beta version of TorchArrow! TorchArrow is a machine learning preprocessing library over batch data, providing performant and Pandas-style easy-to-use API for model development.

Highlights

TorchArrow provides a Python DataFrame that allows extensible UDFs with Velox, with the following features:

Seamless handoff with PyTorch or other model authoring, such as Tensor collation and easily plugging into PyTorch DataLoader and DataPipes

Zero copy for external readers via Arrow in-memory columnar format

Multiple execution runtimes support:

High-performance CPU backend via Velox

(Future Work) GPU backend via libcudf

High-performance C++ UDF support with vectorization

Installation

In this release we are supporting install via PYPI: pip install torcharrow.

Documentation

You can find the API documentation here.

This 10 minutes tutorial provides a short introduction to TorchArrow, and you can also try it in this Colab.

Examples

You can find the example about integrating a TorchRec based training loop utilizing TorchArrow's on-the-fly preprocessing here. More examples are coming soon!

Future Plans

We hope to continue to expand the library, harden API, and gather feedback to enable future releases. Stay tuned!

Beta Usage Note

TorchArrow is currently in the Beta stage and does not have a stable release. The API may change based on user feedback or performance. We are committed to bring this library to stable release, but future changes may not be completely backward compatible. If you have suggestions on the API or use cases you'd like to be covered, please open a GitHub issue. We'd love to hear thoughts and feedback.
Source code(tar.gz)
Source code(zip)

Owner

Facebook Research

GitHub Repository

Segmentation for medical image.

EfficientSegmentation Introduction EfficientSegmentation is an open source, PyTorch-based segmentation framework for 3D medical image. Features A whol

68 Nov 28, 2022

A Conditional Point Diffusion-Refinement Paradigm for 3D Point Cloud Completion

A Conditional Point Diffusion-Refinement Paradigm for 3D Point Cloud Completion This repo intends to release code for our work: Zhaoyang Lyu*, Zhifeng

68 Jan 03, 2023

[PyTorch] Official implementation of CVPR2021 paper "PointDSC: Robust Point Cloud Registration using Deep Spatial Consistency". https://arxiv.org/abs/2103.05465

PointDSC repository PyTorch implementation of PointDSC for CVPR'2021 paper "PointDSC: Robust Point Cloud Registration using Deep Spatial Consistency",

153 Dec 14, 2022

ToFFi - Toolbox for Frequency-based Fingerprinting of Brain Signals

ToFFi Toolbox This repository contains "before peer review" version of the software related to the preprint of the publication ToFFi - Toolbox for Fre

4 Aug 31, 2022

iPOKE: Poking a Still Image for Controlled Stochastic Video Synthesis

iPOKE: Poking a Still Image for Controlled Stochastic Video Synthesis iPOKE: Poking a Still Image for Controlled Stochastic Video Synthesis Andreas Bl

36 Dec 25, 2022

Learning to Predict Gradients for Semi-Supervised Continual Learning

Learning to Predict Gradients for Semi-Supervised Continual Learning Code for project: "Learning to Predict Gradients for Semi-Supervised Continual Le

2 Mar 05, 2022

DeepFashion2 is a comprehensive fashion dataset.

DeepFashion2 Dataset DeepFashion2 is a comprehensive fashion dataset. It contains 491K diverse images of 13 popular clothing categories from both comm

1.8k Jan 07, 2023

SenseNet is a sensorimotor and touch simulator for deep reinforcement learning research

59 Feb 25, 2022

Repository containing the PhD Thesis "Formal Verification of Deep Reinforcement Learning Agents"

Getting Started This repository contains the code used for the following publications: Probabilistic Guarantees for Safe Deep Reinforcement Learning (

5 Aug 31, 2022

Toolbox of models, callbacks, and datasets for AI/ML researchers.

Pretrained SOTA Deep Learning models, callbacks and more for research and production with PyTorch Lightning and PyTorch Website • Installation • Main

1.4k Dec 30, 2022

PyTorch version repo for CSRNet: Dilated Convolutional Neural Networks for Understanding the Highly Congested Scenes

Study-CSRNet-pytorch This is the PyTorch version repo for CSRNet: Dilated Convolutional Neural Networks for Understanding the Highly Congested Scenes

0 Mar 01, 2022

An intuitive library to extract features from time series

Time Series Feature Extraction Library Intuitive time series feature extraction This repository hosts the TSFEL - Time Series Feature Extraction Libra

589 Jan 04, 2023

An MQA (Studio, originalSampleRate) identifier for lossless flac files written in Python.

An MQA (Studio, originalSampleRate) identifier for "lossless" flac files written in Python.

10 Oct 03, 2022

A Data Annotation Tool for Semantic Segmentation, Object Detection and Lane Line Detection.(In Development Stage)

Data-Annotation-Tool How to Run this Tool? To run this software, follow the steps: git clone https://github.com/Autonomous-Car-Project/Data-Annotation

13 Aug 18, 2022

PAMI stands for PAttern MIning. It constitutes several pattern mining algorithms to discover interesting patterns in transactional/temporal/spatiotemporal databases

Introduction PAMI stands for PAttern MIning. It constitutes several pattern mining algorithms to discover interesting patterns in transactional/tempor

43 Jan 08, 2023

A torch.Tensor-like DataFrame library supporting multiple execution runtimes and Arrow as a common memory format

Related tags

Overview

TorchArrow (Warning: Unstable Prototype)

Installation

Binaries

From Source

Get the TorchArrow Source

Install Dependencies

Install TorchArrow

Documentation

Future Plans

License

Comments

Releases(v0.1.0)

v0.1.0(Jul 13, 2022)

Highlights

Installation

Documentation

Examples

Future Plans

Beta Usage Note

Owner

Facebook Research

Segmentation for medical image.

A Conditional Point Diffusion-Refinement Paradigm for 3D Point Cloud Completion

[PyTorch] Official implementation of CVPR2021 paper "PointDSC: Robust Point Cloud Registration using Deep Spatial Consistency". https://arxiv.org/abs/2103.05465

ToFFi - Toolbox for Frequency-based Fingerprinting of Brain Signals

iPOKE: Poking a Still Image for Controlled Stochastic Video Synthesis

Learning to Predict Gradients for Semi-Supervised Continual Learning

DeepFashion2 is a comprehensive fashion dataset.

SenseNet is a sensorimotor and touch simulator for deep reinforcement learning research

Repository containing the PhD Thesis "Formal Verification of Deep Reinforcement Learning Agents"

Toolbox of models, callbacks, and datasets for AI/ML researchers.

PyTorch version repo for CSRNet: Dilated Convolutional Neural Networks for Understanding the Highly Congested Scenes

An intuitive library to extract features from time series

An MQA (Studio, originalSampleRate) identifier for lossless flac files written in Python.

A Data Annotation Tool for Semantic Segmentation, Object Detection and Lane Line Detection.(In Development Stage)

PAMI stands for PAttern MIning. It constitutes several pattern mining algorithms to discover interesting patterns in transactional/temporal/spatiotemporal databases

A 10000+ hours dataset for Chinese speech recognition

Instance-wise Feature Importance in Time (FIT)

Using BERT+Bi-LSTM+CRF

Code for our paper at ECCV 2020: Post-Training Piecewise Linear Quantization for Deep Neural Networks

The official implementation of paper Siamese Transformer Pyramid Networks for Real-Time UAV Tracking, accepted by WACV22