Comparing Data Tools
Spark vs Pandas vs Polars vs SQL (SQLite)
For this exercise we will be using some data from two files: customers.csv and transactions.csv. For various actions we compare these popular methods for interacting with the data. Besides SQL, each tool has some setup involved, but the actual logic for the particular action in question is mostly very similar, however, sometimes it can be very different.
Customer Data
| customer_id | customer_name | phone | account_type | credit_limit | signup_date | |
|---|---|---|---|---|---|---|
| 1 | Customer 1 | customer1@email.com | 555-9768 | Premium | 36307.84 | 2023-10-04 |
| 2 | Customer 2 | customer2@email.com | 555-2590 | Basic | 32940.55 | 2021-09-01 |
| 3 | Customer 3 | customer3@email.com | Basic | 5857.91 | 2020-01-11 | |
| 4 | Customer 4 | customer4@email.com | 555-8829 | Premium | 4449.43 | 2021-01-20 |
| 5 | Customer 5 | customer5@email.com | 555-8271 | Premium | 14111.04 | 2020-12-02 |
| 6 | Customer 6 | customer6@email.com | Standard | 33991.89 | 2020-06-26 | |
| 7 | Customer 7 | customer7@email.com | 555-8719 | Premium | 5779.04 | 2021-04-16 |
| 8 | Customer 8 | customer8@email.com | 555-4043 | Basic | 27572.14 | 2022-04-20 |
| 9 | Customer 9 | customer9@email.com | 555-4276 | Basic | 19972.46 | 2021-01-15 |
| 10 | Customer 10 | customer10@email.com | 555-6148 | Premium | 16975.2 | 2021-08-11 |
| 11 | Customer 11 | customer11@email.com | 555-4570 | Standard | 25379.41 | 2021-08-17 |
| 12 | Customer 12 | customer12@email.com | Premium | 28596.82 | 2023-10-19 | |
| 13 | Customer 13 | customer13@email.com | 555-2595 | Premium | 28817.53 | 2022-06-17 |
| 14 | Customer 14 | customer14@email.com | 555-2175 | Basic | 20712.1 | 2020-12-20 |
| 15 | Customer 15 | customer15@email.com | Premium | 37689.63 | 2023-10-12 | |
| 16 | Customer 16 | customer16@email.com | 555-7266 | Premium | 38340.77 | 2022-05-15 |
| 17 | Customer 17 | customer17@email.com | 555-1367 | Premium | 3747.57 | 2023-04-05 |
| 18 | Customer 18 | customer18@email.com | 555-8288 | Premium | 12302.16 | 2020-11-18 |
| 19 | Customer 19 | customer19@email.com | 555-9901 | Standard | 49511.0 | 2022-12-23 |
| 20 | Customer 20 | customer20@email.com | Standard | 39116.11 | 2021-01-12 | |
| 21 | Customer 21 | customer21@email.com | 555-6673 | Basic | 41865.68 | 2022-12-05 |
| 22 | Customer 22 | customer22@email.com | 555-5575 | Basic | 48432.67 | 2021-11-08 |
| 23 | Customer 23 | customer23@email.com | 555-8618 | Standard | 20061.97 | 2021-11-13 |
| 24 | Customer 24 | customer24@email.com | 555-6487 | Basic | 39008.16 | 2023-02-25 |
| 25 | Customer 25 | customer25@email.com | 555-6855 | Premium | 12889.93 | 2021-08-11 |
| 26 | Customer 26 | customer26@email.com | 555-8298 | Standard | 41137.33 | 2023-03-27 |
| 27 | Customer 27 | customer27@email.com | 555-5321 | Premium | 22839.49 | 2020-07-12 |
| 28 | Customer 28 | customer28@email.com | 555-7884 | Standard | 7637.83 | 2023-09-20 |
| 29 | Customer 29 | customer29@email.com | Basic | 28992.53 | 2023-08-11 | |
| 30 | Customer 30 | customer30@email.com | 555-1906 | Standard | 26907.5 | 2020-09-04 |
| 31 | Customer 31 | 555-3466 | Standard | 4242.89 | 2022-10-23 | |
| 32 | Customer 32 | customer32@email.com | 555-9962 | Standard | 8682.17 | 2021-04-11 |
| 33 | Customer 33 | 555-1861 | Premium | 48181.48 | 2022-10-14 | |
| 34 | Customer 34 | customer34@email.com | 555-7099 | Standard | 7730.41 | 2022-11-14 |
| 35 | Customer 35 | customer35@email.com | 555-9947 | Basic | 2692.5 | 2020-02-22 |
| 36 | Customer 36 | customer36@email.com | 555-5152 | Premium | 16129.56 | 2020-06-11 |
| 37 | Customer 37 | customer37@email.com | 555-9281 | Basic | 4753.41 | 2023-08-11 |
| 38 | Customer 38 | customer38@email.com | 555-9244 | Standard | 42398.55 | 2023-04-23 |
| 39 | Customer 39 | customer39@email.com | 555-5626 | Standard | 21906.16 | 2022-06-11 |
| 40 | Customer 40 | customer40@email.com | 555-6182 | Premium | 28809.77 | 2020-07-09 |
| 41 | Customer 41 | customer41@email.com | 555-8629 | Standard | 45524.17 | 2023-04-30 |
| 42 | Customer 42 | customer42@email.com | 555-5345 | Standard | 9943.52 | 2023-08-16 |
| 43 | Customer 43 | customer43@email.com | 555-9488 | Standard | 10698.98 | 2023-03-22 |
| 44 | Customer 44 | 555-3388 | Basic | 3196.65 | 2021-07-07 | |
| 45 | Customer 45 | 555-4124 | Standard | 13501.6 | 2021-10-08 | |
| 46 | Customer 46 | customer46@email.com | 555-8605 | Basic | 36665.8 | 2023-04-04 |
| 47 | Customer 47 | customer47@email.com | 555-1325 | Standard | 43873.06 | 2020-02-08 |
| 48 | Customer 48 | 555-9298 | Standard | 27285.09 | 2020-08-08 | |
| 49 | Customer 49 | customer49@email.com | 555-1066 | Standard | 33620.84 | 2022-06-22 |
| 50 | Customer 50 | customer50@email.com | 555-6506 | Basic | 28081.52 | 2022-06-22 |
Transactions Data
| transaction_id | customer_id | transaction_date | amount | merchant | category | payment_method | fee | status | is_fraud |
|---|---|---|---|---|---|---|---|---|---|
| 1 | 20 | 2024-09-23 | 368.57 | Amazon | Clothing | 9.21 | Failed | False | |
| 2 | 38 | 2024-03-04 | 1825.21 | Costco | Electronics | Credit Card | 45.63 | Completed | False |
| 3 | 3 | 2024-08-04 | 369.67 | Target | Clothing | 9.24 | Pending | True | |
| 4 | 13 | 2023-03-04 | 800.41 | Walmart | Debit Card | 20.01 | Completed | False | |
| 5 | 17 | 2024-01-12 | 3420.25 | Target | Clothing | Credit Card | 85.51 | Pending | False |
| 6 | 38 | 2023-07-11 | 425.04 | Entertainment | Credit Card | 10.63 | Pending | False | |
| 7 | 12 | 2023-02-17 | 3937.05 | 98.43 | Pending | False | |||
| 8 | 34 | 2023-10-09 | 4432.32 | Electronics | Credit Card | Failed | False | ||
| 9 | 45 | 2023-03-17 | 907.79 | Walmart | Electronics | PayPal | 22.69 | Completed | False |
| 10 | 17 | 2024-11-18 | 3243.31 | Walmart | Entertainment | Debit Card | 81.08 | Pending | False |
| 11 | 101 | 2024-05-26 | 847.24 | Best Buy | Clothing | PayPal | 21.18 | Failed | False |
| 12 | 12 | 2024-03-30 | 935.91 | Walmart | Clothing | PayPal | 23.4 | Pending | False |
| 13 | 3 | 2024-01-26 | 501.09 | Amazon | Electronics | Debit Card | 12.53 | Completed | False |
| 14 | 42 | 2024-11-08 | 614.13 | Groceries | Credit Card | 15.35 | Completed | False | |
| 15 | 29 | 2024-10-21 | 2427.34 | Target | PayPal | 60.68 | Pending | False | |
| 16 | 22 | 2024-06-13 | 2092.33 | Entertainment | Credit Card | 52.31 | Pending | False | |
| 17 | 9 | 2023-02-14 | 2632.23 | Best Buy | Electronics | Credit Card | Completed | False | |
| 18 | 38 | 2024-10-03 | 2366.26 | Target | Groceries | 59.16 | Completed | False | |
| 19 | 5 | 2023-03-27 | 2886.74 | Walmart | Groceries | Debit Card | 72.17 | Pending | False |
| 20 | 34 | 2023-01-01 | 16.89 | Costco | Entertainment | 0.42 | Failed | False | |
| 21 | 13 | 2024-12-11 | 2003.34 | Entertainment | PayPal | 50.08 | Pending | False | |
| 22 | 27 | 2024-10-31 | 4221.48 | Best Buy | Electronics | Credit Card | 105.54 | Failed | False |
| 23 | 41 | 2024-03-31 | 4081.89 | Target | Credit Card | 102.05 | Failed | False | |
| 24 | 1 | 2023-11-22 | 1898.95 | Costco | Entertainment | PayPal | Pending | False | |
| 25 | 5 | 2023-10-11 | 158.18 | Walmart | Entertainment | Debit Card | 3.95 | Failed | False |
| 26 | 11 | 2023-07-12 | 4671.51 | Target | Electronics | PayPal | Completed | False | |
| 27 | 13 | 2023-10-03 | 3692.06 | Target | Electronics | Debit Card | 92.3 | Pending | False |
| 28 | 5 | 2024-11-19 | 1771.78 | Amazon | Debit Card | 44.29 | Pending | False | |
| 29 | 7 | 2024-01-06 | 644.34 | Best Buy | Electronics | Credit Card | 16.11 | Pending | False |
| 30 | 105 | 2023-04-19 | 1177.46 | Costco | Groceries | Debit Card | Completed | False | |
| 31 | 108 | 2024-03-09 | 2896.3 | Walmart | Groceries | 72.41 | Pending | False | |
| 32 | 15 | 2023-03-17 | 2634.9 | Amazon | Electronics | Credit Card | 65.87 | Failed | False |
| 33 | 18 | 2024-07-08 | 1436.45 | Target | Electronics | PayPal | 35.91 | Failed | False |
| 34 | 44 | 2023-09-09 | 43.36 | Target | Debit Card | 1.08 | Completed | False | |
| 35 | 19 | 2024-08-01 | 461.43 | Target | Failed | False | |||
| 36 | 31 | 2024-01-23 | 1281.94 | Costco | Clothing | Debit Card | 32.05 | Failed | False |
| 37 | 43 | 2024-09-03 | 3309.17 | Walmart | Groceries | PayPal | 82.73 | Pending | False |
| 38 | 110 | 2023-07-15 | 4081.56 | Amazon | Clothing | PayPal | 102.04 | Pending | False |
| 39 | 5 | 2024-07-17 | 4117.67 | Costco | Groceries | Debit Card | 102.94 | Failed | False |
| 40 | 6 | 2024-03-18 | 1970.6 | Costco | Credit Card | 49.27 | Completed | False | |
| 41 | 44 | 2024-04-03 | 4009.64 | Walmart | Entertainment | Debit Card | Pending | False | |
| 42 | 40 | 2023-07-12 | 3238.13 | Target | Clothing | Credit Card | Completed | False | |
| 43 | 38 | 2024-01-23 | 3180.48 | Costco | Entertainment | Debit Card | Completed | False | |
| 44 | 20 | 2023-09-28 | 1402.16 | Best Buy | Entertainment | Debit Card | 35.05 | Failed | False |
| 45 | 4 | 2024-10-25 | 2827.95 | Walmart | Groceries | Debit Card | 70.7 | Completed | False |
| 46 | 27 | 2023-12-08 | 3174.68 | Amazon | PayPal | Pending | False | ||
| 47 | 23 | 2024-06-13 | 1374.58 | Costco | Clothing | Debit Card | 34.36 | Completed | False |
| 48 | 30 | 2024-12-07 | 1552.54 | Amazon | Groceries | 38.81 | Completed | False | |
| 49 | 30 | 2024-02-09 | 1041.88 | Debit Card | 26.05 | Pending | False | ||
| 50 | 16 | 2024-11-24 | 2543.75 | Walmart | Groceries | 63.59 | Pending | False |
To fully transparent, SQLite databases do not just come with data handmaid. So how do you get data into a SQLite database? There are several ways, here are a few.
Seeding Data
sh
.mode csv
.import data.csv tablenamepython
# Load the CSV data into a Pandas DataFrame
df = pd.read_csv('../data/customers.csv')
# Create a connection to a new or existing SQLite database
conn = sqlite3.connect('finance.db')
# Write the data
df.to_sql('customers', conn, if_exists='replace', index=False)
conn.commit()
conn.close()python
# Create SQLite engine
engine = create_engine("sqlite:///../finance.db")
# Load CSV data into Polars DataFrames
dataframe = pl.read_csv("../../data/customers.csv")
# Write the data to a SQLite table
dataframe.write_database(
table_name="customers", connection=engine, if_exists="replace"
)python
spark = (
SparkSession.builder.appName("IngestionApp")
.master("local[*]")
.config("spark.jars", "/opt/spark/jars/sqlite-jdbc-3.45.1.0.jar")
.getOrCreate()
)
df = spark.read.csv("../../data/customers.csv", header=True, inferSchema=True)
jdbc_url = "jdbc:sqlite:../../finance.db"
table_name = "customers"
properties = "org.sqlite.JDBC"
(
df.write.format("jdbc")
.option("url", jdbc_url)
.option("dbtable", table_name)
.option("driver", properties)
.mode("overwrite")
.save()
)
spark.stop()The sqlite-jdbc-3.45.1.0.jar can be found at the Maven Repository. I would suggest using the version number that matches the SQLite version.
Reading a File
sql
select * from customers;python
import pandas as pd
df = pd.read_csv("../data/customers.csv")
df.head()python
import polars as pl
df = pl.read_csv("../data/customers.csv")
df.head()python
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("DataProcessing").master("local[*]").getOrCreate()
df = spark.read.csv("../data/customers.csv", header=True, inferSchema=True)
df.show(n=50)