Generating synthetic mobility data for a realistic population with RNNs to improve utility and privacy

Overview

lbs-data

Motivation

Location data is collected from the public by private firms via mobile devices. Can this data also be used to serve the public good while preserving privacy? Can we realize this goal by generating synthetic data for use instead of the real data? The synthetic data would need to balance utility and privacy.

Overview

What:

This project uses location based services (LBS) data provided by a location intelligence company in order to train a RNN model to generate synthetic location data. The goal is for the synthetic data to maintain the properties of the real data, at the individual and aggregate levels, in order to retain its utility. At the same time, the synthetic data should sufficiently differ from the real data at the individual level, in order to preserve user privacy.

Furthermore, the system uses home and work areas as labels and inputs in order to generate location data for synthetic users with the given home and work areas.
This addresses the issue of limited sample sizes. Population data, such as census data, can be used to create the input necessary to output a synthetic location dataset that represents the true population in size and distribution.

Data

/data/

ACS data

data/ACS/ma_acs_5_year_census_tract_2018/

Population data is sourced from the 2018 American Community Survey 5-year estimates.

LBS data

/data/mount/

Privately stored on a remote server.

Geography and time period

  • Geography: The region of study is limited to 3 counties surrounding Boston, MA.
  • Time period: The training and output data is for the first 5-day workweek of May 2018.

Data representation

The LBS data are provided as rows.

device ID, latitude, longitude, timestamp, dwelltime

The data are transformed into "stay trajectories", which are sequences where each index of a sequence represents a 1-hour time interval. Each stay trajectory represents the data for one user (device ID). The value at that index represents the location/area (census tract) where the user spent the most time during that 1-hour interval.

e.g.

[A,B,D,C,A,A,A,NULL,B...]

Where each letter represents a location. There are null values when no location data is reported in the time interval.

home and work locations are inferred for each user stay trajectory. stay trajectories are prefixed with the home and work locations. This home, work prefixes then serve as labels.

[home,work,A,B,D,C,A,A,A,NULL,B...]

Where home,work values are also elements (frequently) occuring in their associated stay trajectory (e.g. home=A).

These sequences are used to train the model and are also output by the model.

RNN

The RNN model developed in this work is meant to be simple and replicable. It was implemented via the open source textgenrnn library. https://github.com/minimaxir/textgenrnn.

Many models (>70) are trained with a variety of hyper parameter values. The models are each trained on the same training data and then use the same input (home, work labels) to generate output synthetic data. The output is evalued via a variety of utility and privacy metrics in order to determine the best model/parameters.

Pipeline

Preprocessing

Define geography / shapefiles

./shapefile_shaper.ipynb

Our study uses 3 counties surrounding Boston, MA: Middlesex, Norfolk, Suffolk counties.

shapefile_shaper prunes MA shapefiles for this geography.

Output files are in ./shapefiles/ma/

Census tracts are used as "areas"/locations in stay trajectories.

Data filtering

./preprocess_filtering.ipynb

The LBS data is sparse. Some users report just a few datapoints, while other users report many. In order to confidently infer home and work locations, and learn patterns, we only include data from devices with sufficient reporting.

./preprocess_filtering.ipynb filters the data accordingly. It pokes the data to try to determine what the right level of filtering is. It outputs saved files with filtered data. Namely, it saves a datafile with LBS data from devices that reported at least 3 days and 3 nights of data during the 1 workweek of the study period. This is the pruned dataset used in the following work.

Attach areas

/attach_areas.ipynb

Census areas are attached to LBS data rows.

Home, work inference

./infer_home_work.ipynb

Defines functions to infer home and work locations (census tracts ) for each device user, based on their LBS data. The home location is where the user spends most time in nighttime hours. The "work" location is where the user spends the most time in workday hours. These locations can be the same.

This file helps determine good hours to use for nighttime hours. Once the functions are defined, they are used to evaluate the data representativeness by comparing the inferred population statistics to ACS 2018 census data.

Saves a mapping of LBS user IDS to the inferred home,work locations.

Stay trajectories setup

./trajectory_synthesis/trajectory_synthesis_notebook.ipynb

Transforms preprocessed LBS data into prefixed stay trajectories.

And outputs files for model training, data generation, and comparison.

Note: for the purposes of model training and data generation, the area tokens within stay trajectories can be arbitrary. What is important for the model’s success is the relationship between them. In order to save the stay trajectories in this repository yet keep real data private, we do the following. We map real census areas to integers, and map areas in stay trajectories to the integers representing the areas. We use the transformed stay trajectories for model training and data generation. The mapping between real census areas and their integer representations is kept private. We can then map the integers in stay trajectories back to the real areas they represent when needed (such as when evaluating trip distance metrics).

Output files:

./data/relabeled_trajectories_1_workweek.txt: D: Full training set of 22704 trajectories

./data/relabeled_trajectories_1_workweek_prefixes_to_counts.json: Maps D home,work label prefixes to counts

./data/relabeled_trajectories_1_workweek_sample_2000.txt: S: Random sample of 2000 trajectories from D.

./data/relabeled_trajectories_1_workweek_prefixes_to_counts_sample_2000.json: Maps S home,work label prefixes to counts

  • This is used as the input for data generation so that the output sythetic sample, S', has a home,work label pair distribution that matches S.

Model training and data generation

./trajectory_synthesis/textgenrnn_generator/

Models with a variety of hyperparameter combinations were trained and then used to generate a synthetic sample.

The files model_trainer.py and generator.py are the templates for the scripts used to train and generate.

The model (hyper)parameter combinations were tracked in a spreadsheet. ./trajectory_synthesis/textgenrnn_generator/textgenrnn_model_parameters_.csv

Evaluation

./trajectory_synthesis/evaluation/evaluate_rnn.ipynb

A variety of utility and privacy evaluation tools and metrics were developed. Models were evaluated by their synthetic data outputs (S'). This was done in ./trajectory_synthesis/evaluation/evaluate_rnn.ipynb. The best model (i.e. best parameters) was determined by these evaluations. The results for this model are captured in trajectory_synthesis/evaluation/final_eval_plots.ipynb.

Owner
Alex
Systems Architect, product oriented Engineer, Hacker for the social good, Math Nerd that loves solving hard problems and working with great people.
Alex
A Re-implementation of the paper "A Deep Learning Framework for Character Motion Synthesis and Editing"

What is This This is a simple re-implementation of the paper "A Deep Learning Framework for Character Motion Synthesis and Editing"(1). Only Sections

102 Dec 14, 2022
Tensors and neural networks in Haskell

Hasktorch Hasktorch is a library for tensors and neural networks in Haskell. It is an independent open source community project which leverages the co

hasktorch 920 Jan 04, 2023
This repository contains the code needed to train Mega-NeRF models and generate the sparse voxel octrees

Mega-NeRF This repository contains the code needed to train Mega-NeRF models and generate the sparse voxel octrees used by the Mega-NeRF-Dynamic viewe

cmusatyalab 260 Dec 28, 2022
Hcaptcha-challenger - Gracefully face hCaptcha challenge with Yolov5(ONNX) embedded solution

hCaptcha Challenger 🚀 Gracefully face hCaptcha challenge with Yolov5(ONNX) embe

593 Jan 03, 2023
Code for "PV-RAFT: Point-Voxel Correlation Fields for Scene Flow Estimation of Point Clouds", CVPR 2021

PV-RAFT This repository contains the PyTorch implementation for paper "PV-RAFT: Point-Voxel Correlation Fields for Scene Flow Estimation of Point Clou

Yi Wei 43 Dec 05, 2022
Implementation of Transformer in Transformer, pixel level attention paired with patch level attention for image classification, in Pytorch

Transformer in Transformer Implementation of Transformer in Transformer, pixel level attention paired with patch level attention for image c

Phil Wang 272 Dec 23, 2022
Code for "Diversity can be Transferred: Output Diversification for White- and Black-box Attacks"

Output Diversified Sampling (ODS) This is the github repository for the NeurIPS 2020 paper "Diversity can be Transferred: Output Diversification for W

50 Dec 11, 2022
A set of simple scripts to process the Imagenet-1K dataset as TFRecords and make index files for NVIDIA DALI.

Overview This is a set of simple scripts to process the Imagenet-1K dataset as TFRecords and make index files for NVIDIA DALI. Make TFRecords To run t

8 Nov 01, 2022
Expand human face editing via Global Direction of StyleCLIP, especially to maintain similarity during editing.

Oh-My-Face This project is based on StyleCLIP, RIFE, and encoder4editing, which aims to expand human face editing via Global Direction of StyleCLIP, e

AiLin Huang 51 Nov 17, 2022
Public Models considered for emotion estimation from EEG

Emotion-EEG Set of models for emotion estimation from EEG. Composed by the combination of two deep-learing models learning together (RNN and CNN) with

Victor Delvigne 21 Dec 23, 2022
Source code of our work: "Benchmarking Deep Models for Salient Object Detection"

SALOD Source code of our work: "Benchmarking Deep Models for Salient Object Detection". In this works, we propose a new benchmark for SALient Object D

22 Dec 30, 2022
Calculates JMA (Japan Meteorological Agency) seismic intensity (shindo) scale from acceleration data recorded in NumPy array

shindo.py Calculates JMA (Japan Meteorological Agency) seismic intensity (shindo) scale from acceleration data stored in NumPy array Introduction Japa

RR_Inyo 3 Sep 23, 2022
Classic Papers for Beginners and Impact Scope for Authors.

There have been billions of academic papers around the world. However, maybe only 0.0...01% among them are valuable or are worth reading. Since our limited life has never been forever, TopPaper provi

Qiulin Zhang 228 Dec 18, 2022
The codebase for Data-driven general-purpose voice activity detection.

Data driven GPVAD Repository for the work in TASLP 2021 Voice activity detection in the wild: A data-driven approach using teacher-student training. S

Heinrich Dinkel 75 Nov 27, 2022
Bayesian algorithm execution (BAX)

Bayesian Algorithm Execution (BAX) Code for the paper: Bayesian Algorithm Execution: Estimating Computable Properties of Black-box Functions Using Mut

Willie Neiswanger 38 Dec 08, 2022
Assginment for UofT CSC420: Intro to Image Understanding

Run the code Open edge_detection.ipynb in google colab. Upload image1.jpg,image2.jpg and my_image.jpg to '/content/drive/My Drive'. chooose 'Run all'

Ziyi-Zhou 1 Feb 24, 2022
A cross-lingual COVID-19 fake news dataset

CrossFake An English-Chinese COVID-19 fake&real news dataset from the ICDMW 2021 paper below: Cross-lingual COVID-19 Fake News Detection. Jiangshu Du,

Yingtong Dou 11 Dec 01, 2022
A collection of pre-trained StyleGAN2 models trained on different datasets at different resolution.

Awesome Pretrained StyleGAN2 A collection of pre-trained StyleGAN2 models trained on different datasets at different resolution. Note the readme is a

Justin 1.1k Dec 24, 2022
Official implementation of paper Gradient Matching for Domain Generalization

Gradient Matching for Domain Generalisation This is the official PyTorch implementation of Gradient Matching for Domain Generalisation. In our paper,

94 Dec 23, 2022
Ludwig is a toolbox that allows to train and evaluate deep learning models without the need to write code.

Translated in 🇰🇷 Korean/ Ludwig is a toolbox that allows users to train and test deep learning models without the need to write code. It is built on

Ludwig 8.7k Dec 31, 2022