Skip to content

Comparing Data Tools

Spark vs Pandas vs Polars vs SQL (SQLite)

For this exercise we will be using some data from two files: customers.csv and transactions.csv. For various actions we compare these popular methods for interacting with the data. Besides SQL, each tool has some setup involved, but the actual logic for the particular action in question is mostly very similar, however, sometimes it can be very different.

Customer Data
customer_idcustomer_nameemailphoneaccount_typecredit_limitsignup_date
1Customer 1customer1@email.com555-9768Premium36307.842023-10-04
2Customer 2customer2@email.com555-2590Basic32940.552021-09-01
3Customer 3customer3@email.comBasic5857.912020-01-11
4Customer 4customer4@email.com555-8829Premium4449.432021-01-20
5Customer 5customer5@email.com555-8271Premium14111.042020-12-02
6Customer 6customer6@email.comStandard33991.892020-06-26
7Customer 7customer7@email.com555-8719Premium5779.042021-04-16
8Customer 8customer8@email.com555-4043Basic27572.142022-04-20
9Customer 9customer9@email.com555-4276Basic19972.462021-01-15
10Customer 10customer10@email.com555-6148Premium16975.22021-08-11
11Customer 11customer11@email.com555-4570Standard25379.412021-08-17
12Customer 12customer12@email.comPremium28596.822023-10-19
13Customer 13customer13@email.com555-2595Premium28817.532022-06-17
14Customer 14customer14@email.com555-2175Basic20712.12020-12-20
15Customer 15customer15@email.comPremium37689.632023-10-12
16Customer 16customer16@email.com555-7266Premium38340.772022-05-15
17Customer 17customer17@email.com555-1367Premium3747.572023-04-05
18Customer 18customer18@email.com555-8288Premium12302.162020-11-18
19Customer 19customer19@email.com555-9901Standard49511.02022-12-23
20Customer 20customer20@email.comStandard39116.112021-01-12
21Customer 21customer21@email.com555-6673Basic41865.682022-12-05
22Customer 22customer22@email.com555-5575Basic48432.672021-11-08
23Customer 23customer23@email.com555-8618Standard20061.972021-11-13
24Customer 24customer24@email.com555-6487Basic39008.162023-02-25
25Customer 25customer25@email.com555-6855Premium12889.932021-08-11
26Customer 26customer26@email.com555-8298Standard41137.332023-03-27
27Customer 27customer27@email.com555-5321Premium22839.492020-07-12
28Customer 28customer28@email.com555-7884Standard7637.832023-09-20
29Customer 29customer29@email.comBasic28992.532023-08-11
30Customer 30customer30@email.com555-1906Standard26907.52020-09-04
31Customer 31555-3466Standard4242.892022-10-23
32Customer 32customer32@email.com555-9962Standard8682.172021-04-11
33Customer 33555-1861Premium48181.482022-10-14
34Customer 34customer34@email.com555-7099Standard7730.412022-11-14
35Customer 35customer35@email.com555-9947Basic2692.52020-02-22
36Customer 36customer36@email.com555-5152Premium16129.562020-06-11
37Customer 37customer37@email.com555-9281Basic4753.412023-08-11
38Customer 38customer38@email.com555-9244Standard42398.552023-04-23
39Customer 39customer39@email.com555-5626Standard21906.162022-06-11
40Customer 40customer40@email.com555-6182Premium28809.772020-07-09
41Customer 41customer41@email.com555-8629Standard45524.172023-04-30
42Customer 42customer42@email.com555-5345Standard9943.522023-08-16
43Customer 43customer43@email.com555-9488Standard10698.982023-03-22
44Customer 44555-3388Basic3196.652021-07-07
45Customer 45555-4124Standard13501.62021-10-08
46Customer 46customer46@email.com555-8605Basic36665.82023-04-04
47Customer 47customer47@email.com555-1325Standard43873.062020-02-08
48Customer 48555-9298Standard27285.092020-08-08
49Customer 49customer49@email.com555-1066Standard33620.842022-06-22
50Customer 50customer50@email.com555-6506Basic28081.522022-06-22
Transactions Data
transaction_idcustomer_idtransaction_dateamountmerchantcategorypayment_methodfeestatusis_fraud
1202024-09-23368.57AmazonClothing9.21FailedFalse
2382024-03-041825.21CostcoElectronicsCredit Card45.63CompletedFalse
332024-08-04369.67TargetClothing9.24PendingTrue
4132023-03-04800.41WalmartDebit Card20.01CompletedFalse
5172024-01-123420.25TargetClothingCredit Card85.51PendingFalse
6382023-07-11425.04EntertainmentCredit Card10.63PendingFalse
7122023-02-173937.0598.43PendingFalse
8342023-10-094432.32ElectronicsCredit CardFailedFalse
9452023-03-17907.79WalmartElectronicsPayPal22.69CompletedFalse
10172024-11-183243.31WalmartEntertainmentDebit Card81.08PendingFalse
111012024-05-26847.24Best BuyClothingPayPal21.18FailedFalse
12122024-03-30935.91WalmartClothingPayPal23.4PendingFalse
1332024-01-26501.09AmazonElectronicsDebit Card12.53CompletedFalse
14422024-11-08614.13GroceriesCredit Card15.35CompletedFalse
15292024-10-212427.34TargetPayPal60.68PendingFalse
16222024-06-132092.33EntertainmentCredit Card52.31PendingFalse
1792023-02-142632.23Best BuyElectronicsCredit CardCompletedFalse
18382024-10-032366.26TargetGroceries59.16CompletedFalse
1952023-03-272886.74WalmartGroceriesDebit Card72.17PendingFalse
20342023-01-0116.89CostcoEntertainment0.42FailedFalse
21132024-12-112003.34EntertainmentPayPal50.08PendingFalse
22272024-10-314221.48Best BuyElectronicsCredit Card105.54FailedFalse
23412024-03-314081.89TargetCredit Card102.05FailedFalse
2412023-11-221898.95CostcoEntertainmentPayPalPendingFalse
2552023-10-11158.18WalmartEntertainmentDebit Card3.95FailedFalse
26112023-07-124671.51TargetElectronicsPayPalCompletedFalse
27132023-10-033692.06TargetElectronicsDebit Card92.3PendingFalse
2852024-11-191771.78AmazonDebit Card44.29PendingFalse
2972024-01-06644.34Best BuyElectronicsCredit Card16.11PendingFalse
301052023-04-191177.46CostcoGroceriesDebit CardCompletedFalse
311082024-03-092896.3WalmartGroceries72.41PendingFalse
32152023-03-172634.9AmazonElectronicsCredit Card65.87FailedFalse
33182024-07-081436.45TargetElectronicsPayPal35.91FailedFalse
34442023-09-0943.36TargetDebit Card1.08CompletedFalse
35192024-08-01461.43TargetFailedFalse
36312024-01-231281.94CostcoClothingDebit Card32.05FailedFalse
37432024-09-033309.17WalmartGroceriesPayPal82.73PendingFalse
381102023-07-154081.56AmazonClothingPayPal102.04PendingFalse
3952024-07-174117.67CostcoGroceriesDebit Card102.94FailedFalse
4062024-03-181970.6CostcoCredit Card49.27CompletedFalse
41442024-04-034009.64WalmartEntertainmentDebit CardPendingFalse
42402023-07-123238.13TargetClothingCredit CardCompletedFalse
43382024-01-233180.48CostcoEntertainmentDebit CardCompletedFalse
44202023-09-281402.16Best BuyEntertainmentDebit Card35.05FailedFalse
4542024-10-252827.95WalmartGroceriesDebit Card70.7CompletedFalse
46272023-12-083174.68AmazonPayPalPendingFalse
47232024-06-131374.58CostcoClothingDebit Card34.36CompletedFalse
48302024-12-071552.54AmazonGroceries38.81CompletedFalse
49302024-02-091041.88Debit Card26.05PendingFalse
50162024-11-242543.75WalmartGroceries63.59PendingFalse

To fully transparent, SQLite databases do not just come with data handmaid. So how do you get data into a SQLite database? There are several ways, here are a few.

Seeding Data

sh
.mode csv
.import data.csv tablename
python
# Load the CSV data into a Pandas DataFrame
df = pd.read_csv('../data/customers.csv')

# Create a connection to a new or existing SQLite database
conn = sqlite3.connect('finance.db')

# Write the data
df.to_sql('customers', conn, if_exists='replace', index=False)

conn.commit()
conn.close()
python
# Create SQLite engine
engine = create_engine("sqlite:///../finance.db")

# Load CSV data into Polars DataFrames
dataframe = pl.read_csv("../../data/customers.csv")

# Write the data to a SQLite table
dataframe.write_database(
    table_name="customers", connection=engine, if_exists="replace"
)
python
spark = (
    SparkSession.builder.appName("IngestionApp")
    .master("local[*]")
    .config("spark.jars", "/opt/spark/jars/sqlite-jdbc-3.45.1.0.jar")
    .getOrCreate()
)

df = spark.read.csv("../../data/customers.csv", header=True, inferSchema=True)

jdbc_url = "jdbc:sqlite:../../finance.db"
table_name = "customers"
properties = "org.sqlite.JDBC"

(
    df.write.format("jdbc")
    .option("url", jdbc_url)
    .option("dbtable", table_name)
    .option("driver", properties)
    .mode("overwrite")
    .save()
)

spark.stop()

The sqlite-jdbc-3.45.1.0.jar can be found at the Maven Repository. I would suggest using the version number that matches the SQLite version.

Reading a File

sql
select * from customers;
python
import pandas as pd

df = pd.read_csv("../data/customers.csv")
df.head()
python
import polars as pl

df = pl.read_csv("../data/customers.csv")
df.head()
python
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("DataProcessing").master("local[*]").getOrCreate()

df = spark.read.csv("../data/customers.csv", header=True, inferSchema=True)

df.show(n=50)