PySpark bindings for H3, a hierarchical hexagonal geospatial indexing system

Last update: Dec 24, 2022

Overview

h3-pyspark: Uber's H3 Hexagonal Hierarchical Geospatial Indexing System in PySpark

PySpark bindings for the H3 core library.

For available functions, please see the vanilla Python binding documentation at:

uber.github.io/h3-py

Installation

From PyPI:

pip install h3-pyspark

From conda

conda config --add channels conda-forge
conda install h3-pyspark

Usage

>> >>> df = df.withColumn('h3_9', h3_pyspark.geo_to_h3('lat', 'lng', 'resolution')) >>> df.show() +---------+-----------+----------+---------------+ | lat| lng|resolution| h3_9| +---------+-----------+----------+---------------+ |37.769377|-122.388903| 9|89283082e73ffff| +---------+-----------+----------+---------------+ ">

>>> from pyspark.sql import SparkSession, functions as F
>>> import h3_pyspark
>>>
>>> spark = SparkSession.builder.getOrCreate()
>>> df = spark.createDataFrame([{"lat": 37.769377, "lng": -122.388903, 'resolution': 9}])
>>>
>>> df = df.withColumn('h3_9', h3_pyspark.geo_to_h3('lat', 'lng', 'resolution'))
>>> df.show()

+---------+-----------+----------+---------------+
|      lat|        lng|resolution|           h3_9|
+---------+-----------+----------+---------------+
|37.769377|-122.388903|         9|89283082e73ffff|
+---------+-----------+----------+---------------+

Publishing

Bump version in setup.cfg
Publish:

python3 -m build
python3 -m twine upload --repository pypi dist/*

Comments

'TypeError: must be real number, not NoneType' when using h3_pyspark

Hi, I have the following spark dataframe and the column of h3 indices is created by applying the lat, lng pairs and the resolution to h3_pypark.geo_to_h3(lat, lng, resolution) function. However I encountered the following error when I tried to check if there's any null in the index column. And it's not only isNull() not working but also any other subsetting operations which all throw me the same error, could anyone provide some insights on what might be the issue and how to fix it? Thanks in advance!

dataframe:

errors:

opened by Tingmi 5
Fix indexing for polygons and lines

Catches some edge cases where h3_line and polyfill would miss. Could be overbroad, which is why the docstrings are changed to say superset, but at least it should be complete

opened by rwaldman 1
Better error handling when null values are passed in
Currently the behavior for all UDFs is that if any row in your dataframe has a null value, the entire build will fail.

This type behavior would be better/more resilient:

@F.udf(T.ArrayType(T.StringType())) def index_shape(geometry, resolution): if geometry is None: return None return _index_shape(geometry, resolution)
opened by kevinschaich 1
Fix bug in index_shape function which missed hexes for long line segments

Fixes #8

Previous behavior for problematic line:

New behavior for same line:

Previous behavior for problematic polygon:

New behavior for same polygon:

cc: @deankieserman @rwaldman

opened by kevinschaich 0
Bug in index_shape function which misses several hexes

Reported by @rwaldman – we can miss several hexes in the worst case if a line's start and endpoints are east-to-west and towards the north or south edge:

Proposed solution is for long line segments (≥ s where s = hex side length) to interpolate several points along the line based on the selected resolution, so that we catch the ones in between:

opened by kevinschaich 0

polyfill fails with valid multipolygon geojson

h3_pyspark.polyfill fails when a valid multipolygon geojson is provided this is expected behavior when utilizing the h3 native library.

however, i thought it would be helpful if this library is able to accept multipolygons. could I get permission to push a PR?

implementation in src/h3_pyspark/__init__.py

@F.udf(returnType=T.ArrayType(T.StringType()))
@handle_nulls
def polyfill(polygons, res, geo_json_conformant):
    # NOTE: this behavior differs from default
    # h3-pyspark expect `polygons` argument to be a valid GeoJSON string
    polygons = json.loads(polygons)
    type_ = polygons["type"].lower()
    if type_ == "multipolygon":
        output = []
        for i in polygons["coordinates"]:
            _polygon = {"type": "Polygon", "coordinates": i}
            output.extend(list(h3.polyfill(_polygon, res, geo_json_conformant)))
        return sanitize_types(output)
    return sanitize_types(h3.polyfill(polygons, res, geo_json_conformant))

test in tests/test_core.py

multipolygon = '{"type": "MultiPolygon","coordinates": [[[[108.98309290409088,13.240363245242063],[108.98343622684479,13.240363245242063],[108.98343622684479,13.240634779729014],[108.98309290409088,13.240634779729014],[108.98309290409088,13.240363245242063]]],[[[108.98349523544312,13.240002939397714],[108.98389220237732,13.240002939397714],[108.98389220237732,13.240269252464502],[108.98349523544312,13.240269252464502],[108.98349523544312,13.240002939397714]]]]}'

def test_polyfill_multipolygon(self):
        h3_test_args, h3_pyspark_test_args = get_test_args(h3.polyfill)
        print(h3_pyspark_test_args)
        integer = 12
        data = {
            "res": integer,
            "geo_json_conformant": True,
            "geojson": multipolygon,
        }
        df = spark.createDataFrame([data])
        actual = df.withColumn("actual", h3_pyspark.polyfill(*h3_pyspark_test_args))
        actual = actual.collect()[0]["actual"]
        print(actual)
        expected = []
        for i in json.loads(multipolygon)["coordinates"]:
            _polygon = {"type": "Polygon", "coordinates": i}
            expected.extend(list(h3.polyfill(_polygon, integer, True)))
        expected = sanitize_types(expected)
        assert sort(actual) == sort(expected)

opened by kangeugine 0

Releases(1.2.6)

1.2.6(Mar 10, 2022)
Add edge cases for lines (#11)

Full Changelog: https://github.com/kevinschaich/h3-pyspark/compare/1.2.5...1.2.6
Source code(tar.gz)
Source code(zip)
1.2.4(Mar 4, 2022)
What's Changed

Handle null values in inputs to UDFs by @kevinschaich in https://github.com/kevinschaich/h3-pyspark/pull/10

Full Changelog: https://github.com/kevinschaich/h3-pyspark/compare/1.2.3...1.2.4
Source code(tar.gz)
Source code(zip)
1.2.3(Feb 24, 2022)
What's Changed

Add error handling for bad geometries by @deankieserman in https://github.com/kevinschaich/h3-pyspark/pull/3

Fix bug in index_shape function which missed hexes for long line segments by @kevinschaich in https://github.com/kevinschaich/h3-pyspark/pull/9

New Contributors

@deankieserman made their first contribution in https://github.com/kevinschaich/h3-pyspark/pull/3

Full Changelog: https://github.com/kevinschaich/h3-pyspark/compare/1.2.2...1.2.3
Source code(tar.gz)
Source code(zip)
1.2.2(Jan 5, 2022)

Source code(tar.gz)
Source code(zip)
1.1.0(Dec 8, 2021)
What's Changed

Create LICENSE by @kevinschaich in https://github.com/kevinschaich/h3-pyspark/pull/1

Add extension functions (index_shape, k_ring_distinct) for spatial indexing & buffers by @kevinschaich in https://github.com/kevinschaich/h3-pyspark/pull/2

New Contributors

@kevinschaich made their first contribution in https://github.com/kevinschaich/h3-pyspark/pull/1

Full Changelog: https://github.com/kevinschaich/h3-pyspark/commits/1.1.0
Source code(tar.gz)
Source code(zip)

Owner

Kevin Schaich

Solving awesome problems @palantir. Part-time open source junkie. Purveyor of hot coffee and thoughtful photographs.

GitHub Repository https://uber.github.io/h3-py/intro.html

Larch: Applications and Python Library for Data Analysis of X-ray Absorption Spectroscopy (XAS, XANES, XAFS, EXAFS), X-ray Fluorescence (XRF) Spectroscopy and Imaging

Larch: Data Analysis Tools for X-ray Spectroscopy and More Documentation: http://xraypy.github.io/xraylarch Code: http://github.com/xraypy/xraylarch L

95 Dec 13, 2022

Python script to automate the plotting and analysis of percentage depth dose and dose profile simulations in TOPAS.

topas-create-graphs A script to automatically plot the results of a topas simulation Works for percentage depth dose (pdd) and dose profiles (dp). Dep

10 Dec 08, 2022

BinTuner is a cost-efficient auto-tuning framework, which can deliver a near-optimal binary code that reveals much more differences than -Ox settings.

BinTuner is a cost-efficient auto-tuning framework, which can deliver a near-optimal binary code that reveals much more differences than -Ox settings. it also can assist the binary code analysis rese

42 Dec 16, 2022

Yet Another Workflow Parser for SecurityHub

YAWPS Yet Another Workflow Parser for SecurityHub "Screaming pepper" by Rum Bucolic Ape is licensed with CC BY-ND 2.0. To view a copy of this license,

8 Dec 22, 2022

Gathering data of likes on Tinder within the past 7 days

tinder_likes_data Gathering data of Likes Sent on Tinder within the past 7 days. Versions November 25th, 2021 - Functionality to get the name and age

12 Jan 05, 2023

Demonstrate a Dataflow pipeline that saves data from an API into BigQuery table

Overview dataflow-mvp provides a basic example pipeline that pulls data from an API and writes it to a BigQuery table using GCP's Dataflow (i.e., Apac

1 Dec 03, 2021

sportsdataverse python package

sportsdataverse-py See CHANGELOG.md for details. The goal of sportsdataverse-py is to provide the community with a python package for working with spo

37 Dec 27, 2022

Python beta calculator that retrieves stock and market data and provides linear regressions.

Stock and Index Beta Calculator Python script that calculates the beta (β) of a stock against the chosen index. The script retrieves the data and resa

4 Jul 29, 2022

Visions provides an extensible suite of tools to support common data analysis operations

Visions And these visions of data types, they kept us up past the dawn. Visions provides an extensible suite of tools to support common data analysis

168 Dec 28, 2022

Python Library for learning (Structure and Parameter) and inference (Statistical and Causal) in Bayesian Networks.

pgmpy pgmpy is a python library for working with Probabilistic Graphical Models. Documentation and list of algorithms supported is at our official sit

2.2k Dec 25, 2022

Includes all files needed to satisfy hw02 requirements

HW 02 Data Sets Mean Scale Score for Asian and Hispanic Students, Grades 3 - 8 This dataset provides insights into the New York City education system

7 Oct 28, 2021

Pipetools enables function composition similar to using Unix pipes.

Pipetools Complete documentation pipetools enables function composition similar to using Unix pipes. It allows forward-composition and piping of arbit

186 Dec 29, 2022

peptides.py is a pure-Python package to compute common descriptors for protein sequences

peptides.py Physicochemical properties and indices for amino-acid sequences. 🗺️ Overview peptides.py is a pure-Python package to compute common descr

32 Dec 31, 2022

A stock analysis app with streamlit

StockAnalysisApp A stock analysis app with streamlit. You select the ticker of the stock and the app makes a series of analysis by using the price cha

50 Nov 27, 2022

Retentioneering: product analytics, data-driven customer journey map optimization, marketing analytics, web analytics, transaction analytics, graph visualization, and behavioral segmentation with customer segments in Python.

What is Retentioneering? Retentioneering is a Python framework and library to assist product analysts and marketing analysts as it makes it easier to

581 Jan 07, 2023

PySpark bindings for H3, a hierarchical hexagonal geospatial indexing system

Related tags

Overview

h3-pyspark: Uber's H3 Hexagonal Hierarchical Geospatial Indexing System in PySpark

Installation

Usage

Publishing

Comments

'TypeError: must be real number, not NoneType' when using h3_pyspark

Fix indexing for polygons and lines

Better error handling when null values are passed in

Fix bug in index_shape function which missed hexes for long line segments

Bug in index_shape function which misses several hexes

polyfill fails with valid multipolygon geojson

Releases(1.2.6)

1.2.6(Mar 10, 2022)

1.2.4(Mar 4, 2022)

What's Changed

1.2.3(Feb 24, 2022)

What's Changed

New Contributors

1.2.2(Jan 5, 2022)

1.1.0(Dec 8, 2021)

What's Changed

New Contributors

Owner

Kevin Schaich

Larch: Applications and Python Library for Data Analysis of X-ray Absorption Spectroscopy (XAS, XANES, XAFS, EXAFS), X-ray Fluorescence (XRF) Spectroscopy and Imaging

Python script to automate the plotting and analysis of percentage depth dose and dose profile simulations in TOPAS.

BinTuner is a cost-efficient auto-tuning framework, which can deliver a near-optimal binary code that reveals much more differences than -Ox settings.

Yet Another Workflow Parser for SecurityHub

Gathering data of likes on Tinder within the past 7 days

Demonstrate a Dataflow pipeline that saves data from an API into BigQuery table

sportsdataverse python package

Python beta calculator that retrieves stock and market data and provides linear regressions.

Visions provides an extensible suite of tools to support common data analysis operations

Python Library for learning (Structure and Parameter) and inference (Statistical and Causal) in Bayesian Networks.

Includes all files needed to satisfy hw02 requirements

Pipetools enables function composition similar to using Unix pipes.

peptides.py is a pure-Python package to compute common descriptors for protein sequences

A stock analysis app with streamlit

Retentioneering: product analytics, data-driven customer journey map optimization, marketing analytics, web analytics, transaction analytics, graph visualization, and behavioral segmentation with customer segments in Python.

Option Pricing Calculator using the Binomial Pricing Method (No Libraries Required)

Statistical Rethinking course winter 2022

ICLR 2022 Paper submission trend analysis

VHub - An API that permits uploading of vulnerability datasets and return of the serialized data

Binance Kline Data With Python