Python library which makes it possible to dynamically mask/anonymize data using JSON string or python dict rules in a PySpark environment.

Overview

pyspark-anonymizer

Python library which makes it possible to dynamically mask/anonymize data using JSON string or python dict rules in a PySpark environment.

Installing

pip install pyspark-anonymizer

Usage

Before Masking

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("your_app_name").getOrCreate()
df = spark.read.parquet("s3://amazon-reviews-pds/parquet/product_category=Electronics/")
df.limit(5).toPandas()
marketplace customer_id review_id product_id product_parent product_title star_rating helpful_votes total_votes vine verified_purchase review_headline review_body review_date year
0 US 51163966 R2RX7KLOQQ5VBG B00000JBAT 738692522 Diamond Rio Digital Player 3 0 0 N N Why just 30 minutes? RIO is really great, but Diamond should increa... 1999-06-22 1999
1 US 30050581 RPHMRNCGZF2HN B001BRPLZU 197287809 NG 283220 AC Adapter Power Supply for HP Pavil... 5 0 0 N Y Five Stars Great quality for the price!!!! 2014-11-17 2014
2 US 52246039 R3PD79H9CTER8U B00000JBAT 738692522 Diamond Rio Digital Player 5 1 2 N N The digital audio "killer app" One of several first-generation portable MP3 p... 1999-06-30 1999
3 US 16186332 R3U6UVNH7HGDMS B009CY43DK 856142222 HDE Mini Portable Capsule Travel Mobile Pocket... 5 0 0 N Y Five Stars I like it, got some for the Grandchilren 2014-11-17 2014
4 US 53068431 R3SP31LN235GV3 B00000JBSN 670078724 JVC FS-7000 Executive MicroSystem (Discontinue... 3 5 5 N N Design flaws ruined the better functions I returned mine for a couple of reasons: The ... 1999-07-13 1999

After Masking

In this example we will add the following data anonymizers:

  • drop_column on column "marketplace"
  • replace all values to "*" of the "customer_id" column
  • replace_with_regex "R\d" (R and any digit) to "*" on "review_id" column
  • sha256 on "product_id" column
  • filter_row with condition "product_parent != 738692522"
from pyspark.sql import SparkSession
import pyspark.sql.functions as spark_functions
import pyspark_anonymizer

spark = SparkSession.builder.appName("your_app_name").getOrCreate()
df = spark.read.parquet("s3://amazon-reviews-pds/parquet/product_category=Electronics/")

dataframe_anonymizers = [
    {
        "method": "drop_column",
        "parameters": {
            "column_name": "marketplace"
        }
    },
    {
        "method": "replace",
        "parameters": {
            "column_name": "customer_id",
            "replace_to": "*"
        }
    },
    {
        "method": "replace_with_regex",
        "parameters": {
            "column_name": "review_id",
            "replace_from_regex": "R\d",
            "replace_to": "*"
        }
    },
    {
        "method": "sha256",
        "parameters": {
            "column_name": "product_id"
        }
    },
    {
        "method": "filter_row",
        "parameters": {
            "where": "product_parent != 738692522"
        }
    }
]

df_parsed = pyspark_anonymizer.Parser(df, dataframe_anonymizers, spark_functions).parse()
df_parsed.limit(5).toPandas()
customer_id review_id product_id product_parent product_title star_rating helpful_votes total_votes vine verified_purchase review_headline review_body review_date year
0 * RPHMRNCGZF2HN 69031b13080f90ae3bbbb505f5f80716cd11c4eadd8d86... 197287809 NG 283220 AC Adapter Power Supply for HP Pavil... 5 0 0 N Y Five Stars Great quality for the price!!!! 2014-11-17 2014
1 * *U6UVNH7HGDMS c99947c06f65c1398b39d092b50903986854c21fd1aeab... 856142222 HDE Mini Portable Capsule Travel Mobile Pocket... 5 0 0 N Y Five Stars I like it, got some for the Grandchilren 2014-11-17 2014
2 * *SP31LN235GV3 eb6b489524a2fb1d2de5d2e869d600ee2663e952a4b252... 670078724 JVC FS-7000 Executive MicroSystem (Discontinue... 3 5 5 N N Design flaws ruined the better functions I returned mine for a couple of reasons: The ... 1999-07-13 1999
3 * *IYAZPPTRJF7E 2a243d31915e78f260db520d9dcb9b16725191f55c54df... 503838146 BlueRigger High Speed HDMI Cable with Ethernet... 3 0 0 N Y Never got around to returning the 1 out of 2 ... Never got around to returning the 1 out of 2 t... 2014-11-17 2014
4 * *RDD9FILG1LSN c1f5e54677bf48936fb1e9838869630e934d16ac653b15... 587294791 Brookstone 2.4GHz Wireless TV Headphones 5 3 3 N Y Saved my. marriage, I swear to god. Saved my.marriage, I swear to god. 2014-11-17 2014

Anonymizers from DynamoDB

You can store anonymizers on DynamoDB too.

Creating DynamoDB table

To create the table follow the steps below.

Using example script

On AWS console:

  • DynamoDB > Tables > Create table
  • Table name: "pyspark_anonymizer" (or any other of your own)
  • Partition key: "dataframe_name"
  • Customize the settings if you want
  • Create table

Writing Anonymizer on DynamoDB

You can run the example script, then edit your settings from there.

Parse from DynamoDB

from pyspark.sql import SparkSession
import pyspark.sql.functions as spark_functions
import pyspark_anonymizer
import boto3
from botocore.exceptions import ClientError as client_error

dynamo_table = "pyspark_anonymizer"
dataframe_name = "table_x"

dynamo_table = boto3.resource('dynamodb').Table(dynamo_table)
spark = SparkSession.builder.appName("your_app_name").getOrCreate()
df = spark.read.parquet("s3://amazon-reviews-pds/parquet/product_category=Electronics/")

df_parsed = pyspark_anonymizer.ParserFromDynamoDB(df, dataframe_name, dynamo_table, spark_functions, client_error).parse()

df_parsed.limit(5).toPandas()

The output will be same as the previous. The difference is that the anonymization settings will be in DynamoDB

Currently supported data masking/anonymization methods

  • Methods
    • drop_column - Drop a column.
    • replace - Replace all column to a string.
    • replace_with_regex - Replace column contents with regex.
    • sha256 - Apply sha256 hashing function.
    • filter_row - Apply a filter to the dataframe.
Turning images into '9-pan' palettes using KMeans clustering from sklearn.

img2palette Turning images into '9-pan' palettes using KMeans clustering from sklearn. Requirements We require: Pillow, for opening and processing ima

Samuel Vidovich 2 Jan 01, 2022
TIANCHI Purchase Redemption Forecast Challenge

TIANCHI Purchase Redemption Forecast Challenge

Haorui HE 4 Aug 26, 2022
fastFM: A Library for Factorization Machines

Citing fastFM The library fastFM is an academic project. The time and resources spent developing fastFM are therefore justified by the number of citat

1k Dec 24, 2022
Lseng-iseng eksplor Machine Learning dengan menggunakan library Scikit-Learn

Kalo dengar istilah ML, biasanya rada ambigu. Soalnya punya beberapa kepanjangan, seperti Mobile Legend, Makan Lontong, Ma**ng L*v* dan lain-lain. Tapi pada repo ini membahas Machine Learning :)

Alfiyanto Kondolele 1 Apr 06, 2022
Drug prediction

I have collected data about a set of patients, all of whom suffered from the same illness. During their course of treatment, each patient responded to one of 5 medications, Drug A, Drug B, Drug c, Dr

Khazar 1 Jan 28, 2022
Esse é o meu primeiro repo tratando de fim a fim, uma pipeline de dados abertos do governo brasileiro relacionado a compras de contrato e cronogramas anuais com spark, em pyspark e SQL!

Olá! Esse é o meu primeiro repo tratando de fim a fim, uma pipeline de dados abertos do governo brasileiro relacionado a compras de contrato e cronogr

Henrique de Paula 10 Apr 04, 2022
SageMaker Python SDK is an open source library for training and deploying machine learning models on Amazon SageMaker.

SageMaker Python SDK SageMaker Python SDK is an open source library for training and deploying machine learning models on Amazon SageMaker. With the S

Amazon Web Services 1.8k Jan 01, 2023
A collection of interactive machine-learning experiments: 🏋️models training + 🎨models demo

🤖 Interactive Machine Learning experiments: 🏋️models training + 🎨models demo

Oleksii Trekhleb 1.4k Jan 06, 2023
A toolkit for making real world machine learning and data analysis applications in C++

dlib C++ library Dlib is a modern C++ toolkit containing machine learning algorithms and tools for creating complex software in C++ to solve real worl

Davis E. King 11.6k Jan 02, 2023
Code base of KU AIRS: SPARK Autonomous Vehicle Team

KU AIRS: SPARK Autonomous Vehicle Project Check this link for the blog post describing this project and the video of SPARK in simulation and on parkou

Mehmet Enes Erciyes 1 Nov 23, 2021
A Python package to preprocess time series

Disclaimer: This package is WIP. Do not take any APIs for granted. tspreprocess Time series can contain noise, may be sampled under a non fitting rate

Maximilian Christ 57 Dec 17, 2022
Continuously evaluated, functional, incremental, time-series forecasting

timemachines Autonomous, univariate, k-step ahead time-series forecasting functions assigned Elo ratings You can: Use some of the functionality of a s

Peter Cotton 343 Jan 04, 2023
Fast Fourier Transform-accelerated Interpolation-based t-SNE (FIt-SNE)

FFT-accelerated Interpolation-based t-SNE (FIt-SNE) Introduction t-Stochastic Neighborhood Embedding (t-SNE) is a highly successful method for dimensi

Kluger Lab 547 Dec 21, 2022
Napari sklearn decomposition

napari-sklearn-decomposition A simple plugin to use with napari This napari plug

1 Sep 01, 2022
Cohort Intelligence used to solve various mathematical functions

Cohort-Intelligence-for-Mathematical-Functions About Cohort Intelligence : Cohort Intelligence ( CI ) is an optimization technique. It attempts to mod

Aayush Khandekar 2 Oct 25, 2021
A simple and lightweight genetic algorithm for optimization of any machine learning model

geneticml This package contains a simple and lightweight genetic algorithm for optimization of any machine learning model. Installation Use pip to ins

Allan Barcelos 8 Aug 10, 2022
Generate music from midi files using BPE and markov model

Generate music from midi files using BPE and markov model

Aditya Khadilkar 37 Oct 24, 2022
Predict the demand for electricity (R) - FRENCH

06.demand-electricity Predict the demand for electricity (R) - FRENCH Prédisez la demande en électricité Prérequis Pour effectuer ce projet, vous devr

1 Feb 13, 2022
Data Version Control or DVC is an open-source tool for data science and machine learning projects

Continuous Machine Learning project integration with DVC Data Version Control or DVC is an open-source tool for data science and machine learning proj

Azaria Gebremichael 2 Jul 29, 2021
This machine learning model was developed for House Prices

This machine learning model was developed for House Prices - Advanced Regression Techniques competition in Kaggle by using several machine learning models such as Random Forest, XGBoost and LightGBM.

serhat_derya 1 Mar 02, 2022