PyPI version

tidypyspark

Make pyspark sing dplyr

Inspired by sparklyr, tidyverse

tidypyspark python package provides minimal, pythonic wrapper around pyspark sql dataframe API in tidyverse flavor.

  • With accessor ts, apply tidypyspark methods where both input and output are mostly pyspark dataframes.

  • Consistent ‘verbs’ (select, arrange, distinct, …)

Also see tidypandas: A grammar of data manipulation for pandas inspired by tidyverse

Usage

# assumed that pyspark session is active
from tidypyspark import ts 
import pyspark.sql.functions as F
from tidypyspark.datasets import get_penguins_path

pen = spark.read.csv(get_penguins_path(), header = True, inferSchema = True)

(pen.ts.add_row_number(order_by = 'bill_depth_mm')
    .ts.mutate({'cumsum_bl': F.sum('bill_length_mm')},
               by = 'species',
               order_by = ['bill_depth_mm', 'row_number'],
               range_between = (-float('inf'), 0)
               )
    .ts.select(['species', 'bill_length_mm', 'cumsum_bl'])
    ).show(5)
    
+-------+--------------+------------------+
|species|bill_length_mm|         cumsum_bl|
+-------+--------------+------------------+
| Adelie|          32.1|              32.1|
| Adelie|          35.2| 67.30000000000001|
| Adelie|          37.7|105.00000000000001|
| Adelie|          36.2|141.20000000000002|
| Adelie|          33.1|             174.3|
+-------+--------------+------------------+

Example

  • tidypyspark code:

(pen.ts.select(['species','bill_length_mm','bill_depth_mm', 'flipper_length_mm'])
 .ts.pivot_longer('species', include = False)
 ).show(5)
 
 +-------+-----------------+-----+
|species|             name|value|
+-------+-----------------+-----+
| Adelie|   bill_length_mm| 39.1|
| Adelie|    bill_depth_mm| 18.7|
| Adelie|flipper_length_mm|  181|
| Adelie|   bill_length_mm| 39.5|
| Adelie|    bill_depth_mm| 17.4|
+-------+-----------------+-----+
  • equivalent pyspark code:

stack_expr = '''
             stack(3, 'bill_length_mm', `bill_length_mm`,
                      'bill_depth_mm', `bill_depth_mm`,
                      'flipper_length_mm', `flipper_length_mm`)
                      as (`name`, `value`)
             '''
pen.select('species', F.expr(stack_expr)).show(5)

tidypyspark relies on the amazing pyspark library and spark ecosystem.

Installation

pip install tidypyspark