Apache Spark with Scala 101: Hands-on data analysis guide (2026)

This post originally appeared on the chaosgenius.io blog. Chaos Genius has been acquired by Flexera.

Apache Spark has revolutionized big data processing, providing fast, near real-time, horizontally scalable and distributed computing capabilities. It’s no wonder that Spark has become the preferred choice for enterprises seeking high-speed, efficient solutions for processing big datasets in parallel. What genuinely distinguishes Spark is its tight connection with Scala, the language it was originally built in. That relationship isn’t accidental. Scala’s seamless compatibility with Spark provides considerable benefits, including efficient code execution, extensive functional programming support and strong type safety—all of which improve performance and reliability.

In this article, we’ll cover everything you need to know about Apache Spark with Scala, why the two are a powerful combination for data processing and walk through a step-by-step hands-on tutorial to get you started with Scala APIs and Spark for data analysis.

What is Apache Spark?

Apache Spark is an open source, distributed computing system designed for fast and general-purpose data processing at scale. Spark started as a research project at UC Berkeley’s AMPLab in 2009, was open sourced in early 2010 under a BSD license and was donated to the Apache Software Foundation project in 2013. It became an Apache top-level project in February 2014.

Spark was created to address the limitations of Hadoop MapReduce, particularly its poor performance with iterative algorithms and interactive data analysis.

Apache Spark logo (Source: spark.apache.org)

Here are some key features of Apache Spark:

In-memory computing — Apache Spark can cache data in memory across operations, significantly speeding up iterative algorithms and interactive queries compared to disk-based systems.
Distributed processing — Apache Spark can distribute data and computations across clusters of computers, allowing it to process massive datasets efficiently.
Fault tolerance — Apache Spark’s core abstraction, Resilient Distributed Datasets (RDDs), lets it automatically recover from node failures by maintaining lineage information.
Lazy evaluation — Apache Spark uses lazy evaluation in its functional programming model, which allows for optimized execution plans
Unified engine — Apache Spark provides a consistent platform for various data processing tasks, including batch processing, interactive queries, streaming, machine learning and graph processing.
Rich ecosystem — Apache Spark ships with higher-level libraries for SQL (Spark SQL), machine learning (MLlib), graph processing (GraphX) and stream processing (Structured Streaming).
Polyglot nature — Apache Spark is primarily written in Scala but it also offers APIs in Java, Python and R, making it accessible to a wide range of developers and engineers.

TL;DR: Apache Spark isn’t just a computation engine. It’s a framework that makes distributed computing approachable and big data processing more efficient.

Why Spark?

Now that we understand what Spark is, let’s explore why it has become such a crucial tool in the big data landscape.

One reason Spark is widely used is that it’s fast, scalable and easy to use. It’s different from Hadoop MapReduce because it processes data in-memory, which reduces processing time. That makes it versatile enough to handle varied workloads on the same engine: machine learning, graph processing, batch jobs and real-time analytics. Whether you’re working with a huge amount of data for real-time analytics or running big batch jobs, Spark can handle it efficiently.

Apache Spark has a wide range of use cases across various industries and applications.

E-commerce: Real-time product recommendations, customer segmentation and inventory optimization
Finance: Fraud detection, risk assessment and algorithmic trading
Healthcare: Patient data analysis, disease prediction and drug discovery
IoT and telematics: Real-time sensor data analysis and predictive maintenance
Social media: Network analysis, sentiment analysis and trend prediction
Log analysis: Real-time monitoring and anomaly detection in system logs

Apache Spark’s power and adaptability make it a top tool for organizations facing big data challenges. It’s great for handling batch jobs, interactive queries and building ML models. Plus, it can process real-time data streams. What’s more, Spark offers one platform to take care of all these tasks.

Spark and Scala: A symbiotic relationship

Scala‘s a natural fit with Spark since it was built using the same language. Scala, which stands for “Scalable Language“, is a statically typed programming language that combines object-oriented and functional programming paradigms. The close integration between Scala and Spark has led to a symbiotic relationship, with Scala being the primary language for developing Spark applications.

Features and benefits of using Spark with Scala

Let’s dive into why Scala is Spark’s native language and the benefits this brings:

1. Performance — Scala compiles to Java bytecode and runs on the Java Virtual Machine (JVM), which provides strong performance and optimization capabilities. Its type system catches many errors at compile time rather than runtime, which matters in distributed systems where runtime errors are costly to debug.

2. Expressiveness — Scala’s syntax is concise yet powerful, letting developers express complex operations in a readable way. This is especially useful for defining data transformations and analyses in Spark.

3. Functional programming — Scala’s strong support for functional programming aligns well with Spark’s design, which heavily uses immutable data structures and higher-order functions.

4. Type safety — Scala’s strong static typing helps catch errors at compile-time, reducing runtime errors in distributed environments.

val data: RDD[Int] = sc.parallelize(1 to 10)
val result: RDD[String] = data.map(_.toString)

5. Performance optimizations — Scala’s compiler can perform various optimizations and its close alignment with Spark’s internals allows for efficient code execution. For example, Scala’s case classes work seamlessly with Spark’s serialization system.

6. Access to latest Spark features — Since Apache Spark is primarily developed in Scala, new features typically appear in the Scala API first before being ported to other language APIs.

7. Seamless Java interoperability — Scala’s interoperability with Java means you can easily use Java libraries in your Spark applications when needed.

Scala APIs vs PySpark, SparkR and Java APIs

PySpark is popular because of Python‘s ease of use and its rich machine-learning ecosystem, but Scala offers better performance and type safety. SparkR is useful for data scientists familiar with R, but it lacks the functional programming power and deep integration Scala provides. Java APIs are available but more verbose compared to Scala’s concise syntax.

TL;DR:

PySpark (Python API):
➤ Pros: Easier for Python developers, great for data science workflows, integrates well with NumPy, Pandas and similar libraries
➤ Cons: Generally slower than Scala due to serialization overhead, may lag behind in features

SparkR (R API):
➤ Pros: Familiar for R users, good for statistical computing and plotting
➤ Cons: Limited functionality compared to Scala API, performance overhead

Java API:
➤ Pros: Familiar for Java developers, good performance
➤ Cons: More verbose than Scala, lacks some of Scala’s functional programming features

Here’s a quick comparison of how a word count program might look in Spark Scala APIs and PySpark (Spark Python) APIs:

1) Scala:

val input = List("Apache Spark is great", "Scala is powerful", "Apache Spark With Scala", "Spark and Scala", "Spark Scala Tutorial", "Scala with Spark Tutorial", "Spark on Scala", "Apache Spark Architecture","Spark Scala Architecture", "Installing Apache Spark", "Scala Build Tool", "SBT" )

val wordCounts = sc.parallelize(input).flatMap(_.split(" ")).map(word => (word, 1)).reduceByKey(_ + _)

wordCounts.collect().foreach(println)

Comparison between Scala APIs vs PySpark – Apache Spark with Scala

2) PySpark:

input = ["Apache Spark is great", "Scala is powerful", "Apache Spark With Scala", "Spark and Scala", "Spark Scala Tutorial", "Scala with Spark Tutorial", "Spark on Scala", "Apache Spark Architecture","Spark Scala Architecture", "Installing Apache Spark", "Scala Build Tool", "SBT"]

word_counts = sc.parallelize(input).flatMap(lambda line: line.split(" ")).map(lambda word: (word, 1)).reduceByKey(lambda a, b: a + b)

print(word_counts.collect())

As you can see, the Scala version is concise and expressive, leveraging Scala’s functional programming features to produce readable and efficient code.

Apache Spark with Scala: Architecture overview

To use Apache Spark with Scala effectively, you need to understand Spark’s architecture and how Scala interacts with its components. Spark’s architecture is designed for distributed computing, allowing it to process large amounts of data across a cluster of machines.

Apache Spark operates on a leader-worker architecture, with a driver program that manages task execution across a cluster of worker nodes.

Apache Spark architecture relies on two main abstractions:

1) Driver program

The driver program is the entry point of a Spark application. It’s responsible for:

Creating a SparkContext (or SparkSession in modern Spark)
Defining the operations to be performed on data
Submitting jobs to the cluster manager
Coordinating the execution of tasks on executors

2) Cluster manager

The cluster manager handles resource allocation across the cluster. Spark supports several cluster managers:

Standalone: Spark’s built-in cluster manager
Hadoop YARN: The resource manager in Hadoop 3
Kubernetes: The current industry-standard for containerized workloads

Note: Apache Mesos was deprecated as a cluster manager in Spark 3.2.0 and was officially retired in late 2025. YARN and Kubernetes are the practical choices today.

3) Executors

Executors are worker nodes in the Spark cluster. They:

Run tasks assigned by the driver program
Store data in memory or on disk
Return results to the driver

Execution flow of a Spark application

➥ App submission

When a Spark application is submitted, the driver program is launched, which communicates with the cluster manager to request resources.

➥ Job creation and DAG formation

The driver translates user code into jobs, breaking them into stages, each further divided into tasks. A Directed Acyclic Graph (DAG) is created to represent the task and stage dependencies.

➥ Stage division and task scheduling

The DAG scheduler breaks down the DAG into stages, and the task scheduler assigns tasks to executors based on resource availability and data locality.

➥ Task execution on worker nodes

Executors process tasks on worker nodes, execute the computation and return results to the driver, which aggregates and presents the final output.

How does Scala interact with Spark?

Scala’s interaction with Apache Spark is efficient because both run on the JVM. Scala code doesn’t require additional conversions or inter-process communication, which is exactly the overhead that PySpark incurs through its Python-to-JVM bridge (Py4J).

Direct compatibility with JVM

Scala runs on the Java Virtual Machine (JVM), just like Spark, which means Scala code does not require additional conversions or inter-process communication. This direct compatibility ensures that Spark applications written in Scala are executed with minimal overhead, leveraging Spark’s native capabilities for performance optimization. Because both Spark and Scala are JVM-based, the processing is generally faster and more efficient than with non-JVM languages like Python or R, which require additional steps for execution (such as serialization/deserialization through PySpark).

Direct API access

Spark’s core APIs are designed with Scala in mind:

Native access: Scala developers interact directly with Spark’s internals without additional abstraction layers
Type safety: Scala’s type system catches errors at compile time, reducing runtime failures
Functional programming support: Spark’s API aligns naturally with Scala’s functional paradigm

Type-Safe data manipulation

Scala interacts with Spark’s structured data APIs through Datasets and DataFrames:

Datasets: Strongly typed, compile-time safe data manipulation. Scala’s Datasets offer type inference and static analysis, catching potential errors before runtime. They’re useful when working with domain-specific objects.
DataFrames: Technically, a DataFrame in Spark is an alias for Dataset[Row], where Row is a generic, untyped JVM object. This unification simplifies the API while retaining performance benefits.

Functional programming paradigm

Scala’s functional programming features align well with Spark’s distributed computing model:

Higher-order functions like map, filter and reduce are efficiently parallelized across Spark’s distributed environment.

val numbers = spark.sparkContext.parallelize(1 to 1000000)
val sumOfSquares = numbers.map(x => x * x).reduce(_ + _)

Immutability: Scala’s emphasis on immutable data structures complements Spark’s fault-tolerant, distributed processing paradigm.
Pattern matching: Enables concise and expressive data transformations on complex nested structures.

def processData(data: Any): String = data match {
	 case s: String => s"String: $s"	 case i: Int if i > 0 => s"Positive Int: $i"	case _ => "Unknown type"}

Advanced performance optimizations

Since Spark is designed with Scala in mind, using Scala lets you tap into Spark’s internal optimizations: Catalyst (query optimization) and Tungsten (memory and CPU efficiency).

Custom serialization

Scala objects can be efficiently serialized for distribution across Spark clusters using Kryo serialization, which is faster and more compact than Java’s default serialization. To enable Kryo serialization:

val conf = new SparkConf().setAppName("MyApp")
	.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
spark = SparkSession.builder().config(conf).getOrCreate()

REPL integration

Scala’s Read-Eval-Print Loop (REPL) integrates seamlessly with Spark through the Spark Scala Shell, allowing for interactive data exploration and rapid prototyping of Spark jobs.

Here’s a very simple example of how you can use Scala to interact with Spark:

1) Creating a SparkSession

import org.apache.spark.sql.SparkSession

val spark = SparkSession.builder.appName("Spark with Scala App").getOrCreate()

Creating a SparkSession – Apache Spark with Scala

2) Creating an RDD

val rdd = spark.sparkContext.parallelize(List(1, 2, 3, 4, 5))

Creating an RDD – Apache Spark with Scala

3) Using map and reduce to process the RDD

val result = rdd.map(x => x * 2).reduce((x, y) => x + y)

Using map and reduce to process the RDD

4) Printing the result

println(result)

Printing the result – Apache Spark with Scala

5) Creating a DataFrame

val df = spark.createDataFrame(Seq(
	(1, "Elon", 45),
	(2, "Jeff", 32),
	(3, "Larry", 20),
	 (4, "Mark", 30)
)).toDF("id", "name", "age")

Creating a DataFrame – Apache Spark with Scala

6) Using filter and map to process the DataFrame

val resultDF = df.filter($"age" >= 30).map(row => row.getAs[String]("name"))

Using filter and map to process the DataFrame – Apache Spark with Scala

7) Printing the final result

resultDF.show()

Printing the final result – Apache Spark with Scala

Using Scala to interact with Spark - Apache Spark With Scala - Spark With Scala - Spark and Scala - Spark Scala Tutorial - Spark on Scala - Spark Architecture - Apache Spark Architecture - Spark Scala Architecture - Installing Apache Spark - Installing Spark on Mac - Installing Spark on Windows - Installing Spark on Linux - Scala Build Tool - SBT for Scala - SBT build - Data Analysis With Scala — Using Scala to interact with Spark – Spark Scala architecture – Apache Spark architecture

This seamless interaction allows Scala developers to leverage Spark’s full potential, making it the most efficient language for Spark-based applications.

Step-by-step guide to installing Apache Spark with Scala (Windows, Mac, Linux)

Now that we understand Spark’s architecture and its relationship with Scala, let’s set up a Spark environment with Scala.

Version compatibility (2026)

Before we start, here’s what you need to know about version compatibility:

Spark version	Java	Scala	Status
Spark 4.x	17 or 21	2.13 only	Latest stable (4.1.x as of 2026)
Spark 3.5.x	8, 11 or 17	2.12 or 2.13	LTS (maintained through Nov 2027)

For this guide, we’ll use Spark 3.5.x with Scala 2.12 since it’s the LTS (long-term support) branch and the most widely deployed in production. We’ll note where Spark 4.x differs.

Heads up on Java 8: Java 8 prior to version 8u371 is deprecated as of Spark 3.5.0. If you’re using Java 8, make sure you’re on a recent update. For new projects, Java 17 is the recommended choice for Spark 3.5.x.

Prerequisites

Before we start, make sure you have the following installed:

Java Development Kit (JDK) 8 or 11 (Spark 3.x is compatible with both).
A text editor or IDE (VScode or IntelliJ IDEA with the Scala plugin is the most popular choice for Spark/Scala development)

1) Installing Apache Spark with Scala on Windows

Windows 10 Logo (Sources: Windows) – Installing Apache Spark on Windows

Step 1—Install Java Development Kit (JDK)

First, download the JDK from the Oracle website or use OpenJDK.

Installing Java Development Kit - Apache Spark With Scala - Spark With Scala - Spark and Scala - Spark Scala Tutorial - Spark on Scala - Spark Architecture - Apache Spark Architecture - Spark Scala Architecture - Installing Apache Spark - Installing Spark on Mac - Installing Spark on Windows - Installing Spark on Linux - Scala Build Tool - SBT for Scala - SBT build - Data Analysis With Scala — Installing Java Development kit – Installing Apache Spark – Installing Spark on Windows

Run the installer and follow the prompts to complete the installation.

Running Java Development Kit Installer - Apache Spark With Scala - Spark With Scala - Spark and Scala - Spark Scala Tutorial - Spark on Scala - Spark Architecture - Apache Spark Architecture - Spark Scala Architecture - Installing Apache Spark - Installing Spark on Mac - Installing Spark on Windows - Installing Spark on Linux - Scala Build Tool - SBT for Scala - SBT build - Data Analysis With Scala — Running Java JInstaller – Installing Apache Spark – Installing Spark on Windows

Now, set the JAVA_HOME environment variable:

Right-click on “My Computer “This PC” > Properties > Advanced system settings > Environment Variables
Add a new system variable JAVA_HOME and set it to your JDK installation path (e.g., C:Program FilesJavajdk….)

To verify the installation, open Command Prompt and type:

java --version

Step 2—Install Scala

Next, download Scala from the official Scala website, or use scoop or choco (Windows package managers) to install it.

Downloading and installing Scala - Apache Spark With Scala - Spark With Scala - Spark and Scala - Spark Scala Tutorial - Spark on Scala - Spark Architecture - Apache Spark Architecture - Spark Scala Architecture - Installing Apache Spark - Installing Spark on Mac - Installing Spark on Windows - Installing Spark on Linux - Scala Build Tool - SBT for Scala - SBT build - Data Analysis With Scala — Downloading and installing Scala – Installing Apache Spark – Installing Spark on Windows

Run the installer and type “Y” to add it to the PATH. And, to verify the installation, open the command prompt and type:

scala -version

Verifying Scala installation – Installing Apache Spark – Installing Apache Spark on Windows

Step 3—Install Apache Spark

Go to the Apache Spark download page and select a pre-built version for Hadoop. Download the .tgz file.

Downloading and installing Apache Spark - Apache Spark With Scala - Spark With Scala - Spark and Scala - Spark Scala Tutorial - Spark on Scala - Spark Architecture - Apache Spark Architecture - Spark Scala Architecture - Installing Apache Spark - Installing Spark on Mac - Installing Spark on Windows - Installing Spark on Linux - Scala Build Tool - SBT for Scala - SBT build - Data Analysis With Scala — Downloading & installing Spark – Installing Apache Spark – Installing Spark on Windows

Create a directory for Spark, e.g., C:\spark and extract the downloaded Spark file into this directory. Verify the extraction and check that the bin folder contains the Spark binaries.

Step 4—Install winutils

Download winutils.exe from the steveloughran/winutils repository, matching your Hadoop version (Hadoop 3 for Spark 3.5.x). Create a folder C:\winutils\bin and place winutils.exe inside it.

Set environment variables:

SPARK_HOME: Set it to the path of your extracted Spark folder, e.g., C:\spark\spark-3.5.4-bin-hadoop3
HADOOP_HOME: Set it to C:\winutils

Configuring Environment variables - Apache Spark With Scala - Spark With Scala - Spark and Scala - Spark Scala Tutorial - Spark on Scala - Spark Architecture - Apache Spark Architecture - Spark Scala Architecture - Installing Apache Spark - Installing Spark on Mac - Installing Spark on Windows - Installing Spark on Linux - Scala Build Tool - SBT for Scala - SBT build - Data Analysis With Scala — Configuring Environment variables – Installing Apache Spark – Installing Spark on Windows

Next, add HADOOP_HOME by creating another variable named HADOOP_HOME and set its value to C:winutils.

Once these variables are set, update the system’s Path variable by editing it and adding the following entries:

%SPARK_HOME%bin
%HADOOP_HOME%bin

Step 5—Verify Spark with Scala installation

To verify Spark with Scala installation, head over to your Command Prompt and type:

spark-shell

This command will start the Scala REPL with Spark. You should see the Spark logo and a Scala prompt.

Verifying Spark with Scala installation - Apache Spark With Scala - Spark With Scala - Spark and Scala - Spark Scala Tutorial - Spark on Scala - Spark Architecture - Apache Spark Architecture - Spark Scala Architecture - Installing Apache Spark - Installing Spark on Mac - Installing Spark on Windows - Installing Spark on Linux - Scala Build Tool - SBT for Scala - SBT build - Data Analysis With Scala — Verifying Spark with Scala installation – Installing Spark – Installing Spark on Windows

Test a basic Spark operation:

val data = spark.range(1, 100)
data.filter(_ % 2 == 0).count()

Testing a basic Apache Spark with Scala operation – Installing Apache Spark on Windows

Testing a basic Spark with Scala operation - Apache Spark With Scala - Spark With Scala - Spark and Scala - Spark Scala Tutorial - Spark on Scala - Spark Architecture - Apache Spark Architecture - Spark Scala Architecture - Installing Apache Spark - Installing Spark on Mac - Installing Spark on Windows - Installing Spark on Linux - Scala Build Tool - SBT for Scala - SBT build - Data Analysis With Scala — Testing a basic Spark with Scala operations – Installing Spark – Installing Spark on Windows

2. Installing Apache Spark with Scala on Mac

macOS Logo (Source: Apple) – Installing Apache Spark on Windows

Step 1—Install Homebrew (if not already installed)

Open your terminal and run the following command to install Homebrew:

/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"

Step 2—Install Java Development Kit (JDK)

Install OpenJDK using Homebrew:

brew install openjdk

Follow the instructions in the terminal to add Java to your PATH.

After installation, verify that Java is installed correctly by checking the version:

java -version

Step 3—Install Scala

Install Scala using Homebrew:

brew install scala

Verify the installation:

scala -version

Verifying Scala versions - Apache Spark With Scala - Spark With Scala - Spark and Scala - Spark Scala Tutorial - Spark on Scala - Spark Architecture - Apache Spark Architecture - Spark Scala Architecture - Installing Apache Spark - Installing Spark on Mac - Installing Spark on Windows - Installing Spark on Linux - Scala Build Tool - SBT for Scala - SBT build - Data Analysis With Scala — Verifying Scala versions – Installing Apache Spark – Installing Apache Spark on Windows

Step 4—Install Apache Spark

Install Spark using Homebrew:

brew install apache-spark

Step 5—Set environment variables

Open your shell configuration file (e.g., ~/.zshrc or ~/.bash_profile):

open -e ~/.bash_profile

Add the following configuration to the file. Note that Homebrew uses different paths on Apple Silicon (M1/M2/M3) Macs vs Intel Macs.

export SPARK_HOME=/usr/local/Cellar/apache-spark/3.x.x/libexec
export PATH=$PATH:$SPARK_HOME/bin

Replace 3.x.x with your installed Spark version.

Save the file and reload the environment:

source ~/.bash_profile

Step 6—Verify Apache Spark with Scala installation

Open a new terminal window and run:

spark-shell

This will start the Scala REPL with Spark.

Installing Apache Spark with Scala on Linux (Ubuntu, Debian, CentOS)

Linux Logo (Source: Linux) – Installing Spark with Linux

Step 1—Install Java (OpenJDK via the package manager)

Open a terminal and update the package manager and install Java runtime environment:

sudo apt update
sudo apt install default-jre

Verify Java installation:

java -version

Step 2—Install Scala

Install Scala using the package manager:

sudo apt install scala

Verify the installation:

scala -version

Note: The apt package for Scala may be outdated on some Linux distributions. If you need a specific Scala version (e.g., 2.12.18 for Spark 3.5.x compatibility), use SDKMAN! instead:

curl -s "https://get.sdkman.io" | bash

source "$HOME/.sdkman/bin/sdkman-init.sh"

sdk install scala 2.12.18

Step 3—Install Apache Spark

Download Spark:

wget https://downloads.apache.org/spark/spark-3.x.x/spark-3.x.x-bin-hadoop3.tgz

Replace 3.x.x with the latest version number.

Extract the Spark archive:

tar xvf spark-3.x.x-bin-hadoop3.tgz
sudo mv spark-3.x.x-bin-hadoop3 /opt/spark

Step 5—Set environment variables

Open your shell configuration file (e.g., ~/.bashrc). Add the following lines:

export SPARK_HOME=/opt/spark
export PATH=$PATH:$SPARK_HOME/bin

Save the file and reload it:

source ~/.bashrc

Step 6—Verify Apache Spark with Scala installation

To verify Spark with Scala installation, open a new Terminal window and run the following command:

spark-shell

This will start the Scala REPL with Spark.

Getting started with Apache Spark with Scala

Now that we have Spark and Scala installed, let’s dive into some basic operations to get you started with Spark programming in Scala.

Step 1—Create a SparkSession

SparkSession is the entry point for programming Spark with the Dataset and DataFrame API. Here’s how to create one:

import org.apache.spark.sql.SparkSession

val spark = SparkSession.builder()
	 .appName("Getting Started")
	.master("local[*]")
	 .getOrCreate()

Step 2—Basic RDD operations

RDDs (Resilient Distributed Datasets) are the fundamental data structure of Spark. Let’s perform some basic operations:

1) Creating an RDD

val data = sc.parallelize(1 to 1000)

2) Transformation — Filtering all even numbers

val evenNumbers = data.filter(_ % 2 == 0)

3) Action — Counting elements

println(s"Number of even numbers: ${evenNumbers.count()}")

Performing basic RDD (Resilient Distributed Datasets) operations - Apache Spark With Scala - Spark With Scala - Spark and Scala - Spark Scala Tutorial - Spark on Scala - Spark Architecture - Apache Spark Architecture - Spark Scala Architecture - Installing Apache Spark - Installing Spark on Mac - Installing Spark on Windows - Installing Spark on Linux - Scala Build Tool - SBT for Scala - SBT build - Data Analysis With Scala — Performing basic RDD operations – Apache Spark with Scala – Spark Scala tutorial

4. Transformation — Squaring each number

val squared = evenNumbers.map(x => x * x)

5. Action — Collecting the results

val result = squared.take(10)
println(s"First 10 squared even numbers: ${result.mkString(", ")}")

Step 3—Introduction to DataFrames and Datasets

DataFrames and Datasets are distributed collections of data organized into named columns. They provide a domain-specific language for structured data manipulation:

First, let’s create a DataFrame from a range.

val df = spark.range(1, 1000).toDF("number")

Then, let’s perform operations using DataFrame API.

val evenDF = df.filter($"number" % 2 === 0)
evenDF.show(5)

Performing operations using DataFrame API in Spark with Scala - Apache Spark With Scala - Spark With Scala - Spark and Scala - Spark Scala Tutorial - Spark on Scala - Spark Architecture - Apache Spark Architecture - Spark Scala Architecture - Installing Apache Spark - Installing Spark on Mac - Installing Spark on Windows - Installing Spark on Linux - Scala Build Tool - SBT for Scala - SBT build - Data Analysis With Scala — Performing operations using DataFrame API in Spark- Spark with Scala – Spark Scala tutorial

Now, here is how you could create a Dataset on Spark with Scala.

case class Person(name: String, age: Int)
val peopleDS = Seq(Person("Elon", 25), Person("Mark", 30), Person("Jeff", 20), Person("Larry", 40), Person("David", 50), Person("George", 30)).toDS()
peopleDS.show()

Creating a Dataset on Spark with Scala - Apache Spark With Scala - Spark With Scala - Spark and Scala - Spark Scala Tutorial - Spark on Scala - Spark Architecture - Apache Spark Architecture - Spark Scala Architecture - Installing Apache Spark - Installing Spark on Mac - Installing Spark on Windows - Installing Spark on Linux - Scala Build Tool - SBT for Scala - SBT build - Data Analysis With Scala — Creating a Dataset on Spark with Scala – Apache Spark with Scala – Spark Scala tutorial

Step 4—Spark SQL basics

Spark SQL lets you write SQL queries against your DataFrames:

1) Registering the DataFrame as a SQL temporary view

df.createOrReplaceTempView("numbers")

2) Run a SQL query

val sqlDF = spark.sql("SELECT * FROM numbers WHERE number % 2 = 0 LIMIT 5")
sqlDF.show()

Registering the DataFrame as a SQL temporary view and running the query - Apache Spark With Scala - Spark With Scala - Spark and Scala - Spark Scala Tutorial - Spark on Scala - Spark Architecture - Apache Spark Architecture - Spark Scala Architecture - Installing Apache Spark - Installing Spark on Mac - Installing Spark on Windows - Installing Spark on Linux - Scala Build Tool - SBT for Scala - SBT build - Data Analysis With Scala — Registering the DataFrame as a SQL temp view and running the query – Spark with Scala

Step 5—Reading and writing data (various formats: CSV, JSON, Parquet)

Spark can read and write data in various formats. Let’s look at CSV, JSON and Parquet:

// Read CSV
val csvDF = spark.read
	 .option("header", "true")
	.csv("path/to/your/file.csv")

// Read JSON
val jsonDF = spark.read.json("path/to/your/file.json")

// Read Parquet
val parquetDF = spark.read.parquet("path/to/your/file.parquet")

// Write DataFrame to Parquet
evenDF.write.parquet("path/to/output/directory")

This concludes our initial exploration of Spark with Scala. Now, in the next section, we’ll dive into more detailed and more hands-on examples of data analysis using these concepts.

Hands-on example: Step-by-step data analysis using Spark with Scala

Now that we’ve covered the basics, let’s work through a practical example of data analysis using Spark with Scala. We’ll analyze a dataset of sales transactions to derive some insights.

Step 1—Set up your environment

First, make sure you have Spark with Scala installed as per the instructions in the previous section. Open your terminal and launch the spark-shell:

spark-shell

This command starts an interactive Scala shell with sc (SparkContext) and spark (SparkSession) pre-configured for immediate use.

Step 2—Create a new Scala project

In spark-shell, you already have a SparkSession instance (spark) automatically created for you. If you’re working in a different environment, you can initiate a Spark session using the following code:

val spark = SparkSession.builder
	.appName("Spark with Scala Example")
	.master("local[*]")
	 .getOrCreate()

The master(“local[*]”) setting tells Spark to run locally using all available CPU cores.

To load the Scala file or want to execute the block of code in the Spark shell, you can load the entire file and execute it to do so, you can type in the following command:

:load <your-file-name>.scala

Step 3—Load data into Spark

Let’s load some sample data representing sales information for a retail store:

import spark.implicits._
import org.apache.spark.sql.functions._

val data = Seq(

  ("Elon Musk",                    52, "Technology",         2200),

  ("Jeff Bezos",                   59, "E-commerce",         2000),

  ("Bernard Arnault",              74, "Luxury Goods",       1800),

  ("Bill Gates",                   68, "Philanthropy",       1500),

  ("Warren Buffett",               93, "Investments",        1300),

  ("Larry Ellison",                79, "Software",           1600),

  ("Larry Page",                   50, "Technology",         1400),

  ("Sergey Brin",                  49, "Technology",         1400),

  ("Mark Zuckerberg",              39, "Social Media",       1000),

  ("Steve Ballmer",                67, "Technology",          900),

  ("Carlos Slim",                  83, "Telecommunications", 1100),

  ("Mukesh Ambani",                66, "Energy",             1200),

  ("Francoise Bettencourt Meyers", 70, "Cosmetics",           800),

  ("Amancio Ortega",               87, "Fashion",             950),

  ("Gautam Adani",                 61, "Energy",             1300)

)


val columns = Seq("Name", "Age", "Category", "Amount")
val df = data.toDF(columns: _*)
df.show()

Here, each tuple represents a customer’s name, age, category (the department they bought from) and total amount spent. The toDF method converts this into a DataFrame with defined column names.

Note: The import spark.implicits._ is required for the $”column” column notation and .toDF() conversions. The import org.apache.spark.sql.functions._ brings in Spark SQL functions like when, avg and sum. Always include these imports at the top of your script.

When you run df.show(), Spark prints the following output:

Loading data into Spark – Apache Spark With Scala – Data Analysis With Scala

Printing and displaying the result - Apache Spark With Scala - Spark With Scala - Spark and Scala - Spark Scala Tutorial - Spark on Scala - Spark Architecture - Apache Spark Architecture - Spark Scala Architecture - Installing Apache Spark - Installing Spark on Mac - Installing Spark on Windows - Installing Spark on Linux - Scala Build Tool - SBT for Scala - SBT build - Data Analysis With Scala — Printing and displaying the result – Spark with Scala – Data analysis with Scala

Step 4—Data exploration and cleaning

Before any analysis, filter out rows where the Amountis below a threshold. Here we’ll keep only customers who spent 1000 or more

val filteredDF = df.filter($"Amount" >= 1000)
filteredDF.show()

As you can see this code filters out customers who spent less than 1000.

Data exploration and data cleaning - Apache Spark With Scala - Spark With Scala - Spark and Scala - Spark Scala Tutorial - Spark on Scala - Spark Architecture - Apache Spark Architecture - Spark Scala Architecture - Installing Apache Spark - Installing Spark on Mac - Installing Spark on Windows - Installing Spark on Linux - Scala Build Tool - SBT for Scala - SBT build - Data Analysis With Scala — Data exploration and data cleaning – Apache Spark with Scala – Data Analysis with Scala

Step 5—Prepare and transform data

Next, let’s transform the dataset. We’ll add a new column called DiscountedAmount, which applies a 10% discount for amounts greater than 1100.

val transformedDF = filteredDF.withColumn("DiscountedAmount", 
	when($"Amount" > 1100, $"Amount" * 0.9).otherwise($"Amount"))
transformedDF.show()

As you can see, this creates a new column with discounted values where applicable:

Preparing and transforming data - Apache Spark With Scala - Spark With Scala - Spark and Scala - Spark Scala Tutorial - Spark on Scala - Spark Architecture - Apache Spark Architecture - Spark Scala Architecture - Installing Apache Spark - Installing Spark on Mac - Installing Spark on Windows - Installing Spark on Linux - Scala Build Tool - SBT for Scala - SBT build - Data Analysis With Scala — Preparing and transforming data – Apache Spark with Scala – Data analysis with Scala

Step 6—Perform grouped analysis

Let’s group the data by Category and calculate the average amount spent per category:

val avgAmountDF = transformedDF.groupBy("Category").agg(avg("Amount").as("AvgAmount"))
avgAmountDF.show()

The output would look like this:

Performing grouped analysis - Apache Spark With Scala - Spark With Scala - Spark and Scala - Spark Scala Tutorial - Spark on Scala - Spark Architecture - Apache Spark Architecture - Spark Scala Architecture - Installing Apache Spark - Installing Spark on Mac - Installing Spark on Windows - Installing Spark on Linux - Scala Build Tool - SBT for Scala - SBT build - Data Analysis With Scala — Performing grouped analysis – Apache Spark with Scala – Data analysis with Scala

This provides insight into the spending behavior across different categories.

Step 7—Perform analysis with Spark SQL

You can also register the DataFrame as a temporary view and use SQL to query the data. This is especially useful for more complex queries.

df.createOrReplaceTempView("sales")
val sqlResult = spark.sql("SELECT Category, COUNT(*) AS TotalCustomers, SUM(Amount) AS TotalSales FROM sales GROUP BY Category")
sqlResult.show()

This SQL query counts the total number of customers and sums the total sales per category:

Performing analysis with Spark SQL - Apache Spark With Scala - Spark With Scala - Spark and Scala - Spark Scala Tutorial - Spark on Scala - Spark Architecture - Apache Spark Architecture - Spark Scala Architecture - Installing Apache Spark - Installing Spark on Mac - Installing Spark on Windows - Installing Spark on Linux - Scala Build Tool - SBT for Scala - SBT build - Data Analysis With Scala — Performing analysis with Spark SQL – Apache Spark with Scala – Data Analysis with Scala

Step 8—Save processed data (Parquet, CSV, JSON)

Finally, you can save the processed DataFrame in multiple formats. For example, let’s save the final DataFrame in Parquet format:

transformedDF.write.parquet("path/to/output/output_parquet")

Bonus: Configuring a Spark Project with sbt (Scala Build Tool)

Scala Build Tool (sbt) is the de facto build tool for Scala projects. It’s particularly useful for managing dependencies and building Spark applications. Let’s walk through the process of setting up a Spark project with sbt.

What is sbt (Scala Build Tool)?

sbt (Scala Build Tool) is a powerful and fully open source build tool for Scala and Java projects. It lets developers manage dependencies, compile code, run tests and package applications efficiently through a simple configuration file (build.sbt).

sbt is similar in concept to tools like Maven and Gradle. However, sbt is specifically optimized for Scala’s requirements, making it the default tool for Scala developers. Its key features include:

sbt (Scala Build Tool) simplifies managing dependencies, including external libraries like Spark, Hadoop and more.
sbt (Scala Build Tool) continuously compiles and tests your code as you make changes, making development more efficient.
sbt (Scala Build Tool) allows easy packaging of your application into JAR files, which you can deploy in different environments (e.g., local, cluster).
sbt (Scala Build Tool) integrates smoothly with most major IDEs, like VScode, IntelliJ IDEA and more, enhancing the development experience.

Prerequisites

Make sure you have the following installed:

Java 17 JDK or later (Spark relies on the JVM)
Scala 2.12.x (for Spark 3.5.x compatibility; use 2.13.x for Spark 4.x)
sbt (latest stable version)
Apache Spark 3.5.x (compatible with Scala 2.12.x)
IntelliJ IDEA with the Scala plugin (recommended)

Optionally, you can use Docker for a consistent development environment, especially if you need to switch between different versions of Spark, Scala, or Java.

Step 1—Install sbt

First, start by installing sbt. Use the following commands based on your operating system:

For macOS (using SDKMAN! is recommended):

SDKMAN!

curl -s "https://get.sdkman.io" | bash

source "$HOME/.sdkman/bin/sdkman-init.sh"

sdk install sbt

Homebrew:

brew install sbt

For Windows:

Download the installer files:

Chocolatey:

> choco install sbt

Scoop:

> scoop install sbt

For Linux:

Linux (deb):

echo "deb https://repo.scala-sbt.org/scalasbt/debian all main" | sudo tee /etc/apt/sources.list.d/sbt.list
curl -sL "https://keyserver.ubuntu.com/pks/lookup?op=get&search=0x2EE0EA64E40A89B84B2DF73499E82A75642AC823" | sudo apt-key add
sudo apt-get update
sudo apt-get install sbt

Linux (rpm):

curl -sL https://www.scala-sbt.org/sbt-rpm.repo | sudo tee /etc/yum.repos.d/sbt-rpm.repo

sudo yum install sbt

Verify the installations by running the following commands in your terminal:

java -version
scala -version
sbt -version

Verifying Java, Scala and sbt installations - Apache Spark With Scala - Spark With Scala - Spark and Scala - Spark Scala Tutorial - Spark on Scala - Spark Architecture - Apache Spark Architecture - Spark Scala Architecture - Installing Apache Spark - Installing Spark on Mac - Installing Spark on Windows - Installing Spark on Linux - Scala Build Tool - SBT for Scala - SBT build - Data Analysis With Scala — Verifying Java, Scala and sbt installations – Scala Build Tool – sbt for Scala

Step 2—Setting up the project structure

Now, let’s start by creating the basic structure for the Spark project. Open a terminal and create a new directory for the project:

mkdir SparkWithScalaProject
cd SparkWithScalaProject

Inside this directory, you will need to create a typical sbt project structure:

mkdir -p src/main/scala
mkdir -p src/test/scala
mkdir -p src/main/resources

This creates a standard Scala project structure with separate directories for main and test code.

Step 3—Create the build.sbt file

In the root of the ‘SparkWithScalaProject’ directory, create a file named build.sbt and add the following content:

name := "SparkWithScalaProject"

organization := "com.example"

version := "0.1"

scalaVersion := "2.12.18"

val sparkVersion = "3.5.4"

libraryDependencies ++= Seq(

  "org.apache.spark" %% "spark-core" % sparkVersion % "provided",

  "org.apache.spark" %% "spark-sql"  % sparkVersion % "provided"

)

// Fork a new JVM when running Spark jobs to avoid classpath conflicts

fork := true

// JVM options for running Spark locally

javaOptions ++= Seq(

  "-Xms512m",

  "-Xmx2048m",

  "-XX:+UseG1GC"

)

A few things to note here. The % “provided” scope marks Spark dependencies as “provided” because Spark is already on the classpath at runtime in a cluster environment. This keeps your fat JAR lean. When testing locally with sbt run, you’ll need to remove % “provided” or configure sbt to include provided dependencies on the run classpath. We’ll address that in the assembly step.

Step 4—Create build.properties file

Create a project/build.properties file to specify the sbt version:

mkdir project
echo 'sbt.version=1.10.2' > project/build.properties

Step 5—Write your first Scala code

Create a simple Scala file named SparkWithScalaMain.scala in src/main/scala/com/example with the following content:

object SparkWithScalaMain extends App {
	 println("Spark with Scala!")
}

Step 6—Compile and run the project locally

Start the sbt console:

sbt

In the sbt console, compile your project:

compile

Run your main object:

run

Compiling and running the project locally using sbt - Apache Spark With Scala - Spark With Scala - Spark and Scala - Spark Scala Tutorial - Spark on Scala - Spark Architecture - Apache Spark Architecture - Spark Scala Architecture - Installing Apache Spark - Installing Spark on Mac - Installing Spark on Windows - Installing Spark on Linux - Scala Build Tool - SBT for Scala - SBT build - Data Analysis With Scala — Compiling and running the project locally using sbt – Scala Build Tool – sbt for Scala

Exit the sbt console with:

exit

Step 7—Add data for processing

Now, we will create sales data to analyze sales data for a store. Create a CSV file named sales_data.csv inside a new folder called data in your project directory:

mkdir data

Add the following content to sales_data.csv:

TransactionID,Product,Quantity,Price,Date
1,Smartphone,2,699.99,2024-01-01
2,Washing Machine,1,499.99,2024-01-02
3,Laptop,1,999.99,2024-01-03
4,Tablet,3,299.99,2024-01-04
5,Headphones,5,99.99,2024-01-05
6,Smartwatch,2,249.99,2024-01-06
7,Refrigerator,1,899.99,2024-01-07
8,Television,2,1099.99,2024-01-08
9,Vacuum Cleaner,3,149.99,2024-01-09
10,Microwave,2,199.99,2024-01-10
11,Blender,4,59.99,2024-01-11
12,Air Conditioner,1,799.99,2024-01-12
13,Camera,1,549.99,2024-01-13
14,Speaker,3,199.99,2024-01-14
15,Smartphone,1,699.99,2024-01-15
16,Washing Machine,1,499.99,2024-01-16
17,Television,1,1099.99,2024-01-17
18,Laptop,1,1999.99,2024-01-18
19,Microwave,2,199.99,2024-01-19
20,Blender,3,59.99,2024-01-20
21,Camera,2,549.99,2024-01-21
22,Smartwatch,1,249.99,2024-01-22
23,Headphones,4,99.99,2024-01-23
24,Air Conditioner,1,799.99,2024-01-24
25,Vacuum Cleaner,2,149.99,2024-01-25
26,Smartphone,1,699.99,2024-01-26
27,Laptop,1,999.99,2024-01-27

Step 8—Update build.sbt for Spark dependencies

Make sure your build.sbt includes Spark dependencies if not already done:

libraryDependencies += "org.apache.spark" %% "spark-core" % "3.5.0"libraryDependencies += "org.apache.spark" %% "spark-sql" % "3.5.0"

Step 9—Implement your Spark job logic

Create a package object by adding a file named package.scala in src/main/scala/com/example/sparktutorial with functions to create a Spark session and parse command-line arguments.

package com.example 

import org.apache.spark.sql.SparkSession 

package object sparktutorial { 
	 def createSparkSession(appName: String): SparkSession = { 
		 SparkSession.builder()
			 .appName(appName)
			.master("local[*]")
			 .config("spark.sql.caseSensitive", value = true)
			 .config("spark.sql.session.timeZone", value = "UTC")
			 .getOrCreate()
	 }

	def parseArgs(args: Array[String]): (String) = { 
		 require(args.length == 1,"Expecting input path as argument") 
		val inputPath = args(0) 
		println(s"Input path: $inputPath") 
		 inputPath 
	 }
}

Then create an analysis file named SalesAnalysis.scala in the same package with the following code to analyze the sales data:

package com.example.sparktutorial 

import org.apache.spark.sql.{DataFrame,SparkSession} 
import org.apache.spark.sql.functions._ 

object Analysis { 
	 def analyzeSalesData(spark: SparkSession,inputPath: String): DataFrame = { 
		// Read sales data from CSV file into DataFrame 
		val salesDF = spark.read.option("header","true").csv(inputPath) 
		
		// Calculate total sales per product and add it as a new column 'TotalSales' 
		 val resultDF = salesDF.withColumn("TotalSales",col("Quantity").cast("int") * col("Price").cast("double")) 
			 .groupBy("Product") 
			 .agg(sum("TotalSales").alias("TotalSales")) 
			 .orderBy(desc("TotalSales")) 
		
		resultDF.show() // Display result in console 
		 
		 // Return result DataFrame for further processing if needed 
		 resultDF 
	} 
}

Step 10—Update the main object to call the analysis logic

Update your SparkWithScalaMain.scala to include logic to call the analysis function:

package com.example.sparktutorial 

import org.apache.spark.sql.SparkSession 

object SparkWithScalaMain extends App { 
	 // Create Spark session 
	 val spark = createSparkSession("Spark Sales Analysis") 
	
	// Specify input path for sales data CSV file 
	val inputPath = args(0) // Pass input path as command line argument 
	
	// Call analysis function 
	Analysis.analyzeSalesData(spark,inputPath) 
	 
	 // Stop the Spark session 
	 spark.stop() 
}

Step 11—Compile and run Your Spark job locally

Compile and run the project using sbt, passing the input path of the CSV file:

sbt "run data/sales_data.csv"

This command will read from sales_data.csv, perform the analysis defined in SalesAnalysis.scala and display results in the console.

Step 12—Configure logging

Create a log4j2.properties file in src/main/resources to manage logging output:

status = error 

name = PropertiesConfig 

appender.console.type = Console 

appender.console.name = ConsoleAppender 

appender.console.layout.type = PatternLayout 

appender.console.layout.pattern = %d{HH:mm:ss.SSS} [%t] %-5level %msg%n 

rootLogger.level = INFO 

rootLogger.appenderRefs = console 

rootLogger.appenderRef.console.ref = ConsoleAppender

Step 13—Package your application as a JAR

Use the following SBT command to package your project into a JAR file:

sbt package

This will generate a JAR file in the target/scala-2.13/ directory (depending on your Scala version). However, this JAR will not include any dependencies.

Step 14—Packaging using sbt-assembly to create a fat JAR

To package your project along with its dependencies into a single standalone JAR (often referred to as a fat JAR), you’ll need the sbt-assembly plugin.

Add the sbt-assembly plugin to your project by creating a project/plugins.sbt file with the following content:

addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "1.1.0")

Next, configure sbt-assembly in your build.sbt file:

assemblyOption in assembly := (assemblyOption in assembly).value.copy(includeScala = false)

mainClass in assembly := Some("com.example.sparktutorial.SparkWithScalaMain")

Finally, run the following command to generate a fat JAR:

sbt assembly

This will produce a JAR file with all dependencies bundled, which you can run or distribute as needed.

Step 15—Submit your JAR to Spark cluster

Use the following command to submit your JAR file to your Spark cluster (replace <master-url> with your actual master URL):

spark-submit 
--class com.example.sparktutorial.SparkWithScalaMain 
--master <master-url> 
target/scala-2.12/sparkexample-assembly-0.1.jar 
data/sales_data.csv

Replace <master-url> with your Spark cluster’s master URL (e.g., yarn for YARN cluster, spark://host:port for standalone cluster). And, make sure sparkexample-assembly-0.1.jar matches the exact artifact name generated by sbt-assembly.

Follow these steps and you’ll be all set to create, develop and run Spark projects in Scala using sbt.

Conclusion

Apache Spark with Scala offers a powerful toolset for big data processing and analysis. Throughout this guide, we’ve explored the symbiotic relationship between Spark and Scala, delved into Spark’s architecture and walked through the process of setting up a Spark environment and building Spark applications using Scala.

In this article, we’ve covered:

Fundamentals of Apache Spark and its ecosystem
Why Spark has become a go-to solution for big data processing
The benefits of using Scala with Spark
Setting up a Spark environment on Windows, Mac and Linux
Basic and advanced Spark operations using Scala
A hands-on data analysis example using Spark with Scala
Configuring and managing Spark projects using sbt

… and so much more!

Want to learn more? Reach out for a chat

FAQs

What is Apache Spark?

Apache Spark is an open-source, distributed computing system designed for fast and general-purpose data processing at scale. It was developed to overcome the limitations of Hadoop MapReduce, with a focus on in-memory processing for iterative algorithms and interactive queries.

How does Apache Spark achieve fault tolerance?

Spark uses Resilient Distributed Datasets (RDDs), which maintain lineage information that allows the system to automatically reconstruct lost data partitions after node failures.

What is lazy evaluation in Spark?

Lazy evaluation means Spark doesn’t execute operations until an action is called. This lets Spark build an optimized execution plan before running any computation, reducing unnecessary work.

What role does the driver program play in a Spark application?

The driver program creates a SparkContext, defines operations on data, submits jobs to the cluster manager and coordinates task execution on executors.

How do executors function within a Spark cluster?

Executors are processes running on worker nodes that execute tasks assigned by the driver, store data in memory or on disk and return results to the driver.

What is Scala with Spark?

Scala is the native language for Spark. It offers deep integration with Spark’s internals, compile-time type safety and performance optimizations that other languages can’t match without additional overhead.

How do I run Scala in Spark?

You can use spark-shell for interactive exploration or create a full project with sbt for production-level code, then submit it with spark-submit.

How do you verify your installation of Spark with Scala?

Run spark-shell in your command line. You should see the Spark logo and a scala> prompt.

What’s the difference between using Scala APIs and PySpark?

Scala APIs offer better performance and type safety because both Scala and Spark run on the JVM with no serialization overhead. PySpark is easier for Python developers but slower due to the Py4J bridge between Python and the JVM.

Can you use Scala 3 with Spark?

Spark 3.5.x officially supports Scala 2.12 and 2.13. Spark 4.x uses Scala 2.13. Full Scala 3 support is not yet available as of 2026. Check the Apache Spark downloads page for the latest compatibility information before choosing a Scala version.

What Spark version should I use in 2026?

For new production work, Spark 3.5.x remains the recommended LTS choice (maintained through November 2027). Spark 4.x is the latest stable release and requires Java 17 or 21 and Scala 2.13. If you’re starting a brand new project and don’t need LTS guarantees, Spark 4.x is worth considering.

Is Scala better than Python for Spark?

Scala provides better raw performance and type safety. Python has a richer data science library ecosystem, making it more popular for exploratory work. For production pipelines where performance matters, Scala is the stronger choice.

Request a demo

FinOps

Apache Spark with Scala 101: Hands-on data analysis guide (2026)

What is Apache Spark?

Why Spark?

Spark and Scala: A symbiotic relationship

Features and benefits of using Spark with Scala

Scala APIs vs PySpark, SparkR and Java APIs

Apache Spark with Scala: Architecture overview

How does Scala interact with Spark?

Direct compatibility with JVM

Direct API access

Type-Safe data manipulation

Functional programming paradigm

Advanced performance optimizations

Custom serialization

REPL integration

Step-by-step guide to installing Apache Spark with Scala (Windows, Mac, Linux)

Prerequisites

1) Installing Apache Spark with Scala on Windows

Step 1—Install Java Development Kit (JDK)

Step 2—Install Scala

Step 3—Install Apache Spark

Step 4—Install winutils

Step 5—Verify Spark with Scala installation

2. Installing Apache Spark with Scala on Mac

Step 1—Install Homebrew (if not already installed)

Step 2—Install Java Development Kit (JDK)

Step 3—Install Scala

Step 4—Install Apache Spark

Step 5—Set environment variables

Step 6—Verify Apache Spark with Scala installation

Installing Apache Spark with Scala on Linux (Ubuntu, Debian, CentOS)

Step 1—Install Java (OpenJDK via the package manager)

Step 2—Install Scala

Step 3—Install Apache Spark

Step 5—Set environment variables

Step 6—Verify Apache Spark with Scala installation

Getting started with Apache Spark with Scala

Step 1—Create a SparkSession

Step 2—Basic RDD operations

Step 3—Introduction to DataFrames and Datasets

Step 4—Spark SQL basics

Step 5—Reading and writing data (various formats: CSV, JSON, Parquet)

Hands-on example: Step-by-step data analysis using Spark with Scala

Step 1—Set up your environment

Step 2—Create a new Scala project

Step 3—Load data into Spark

Step 4—Data exploration and cleaning

Step 5—Prepare and transform data

Step 6—Perform grouped analysis

Step 7—Perform analysis with Spark SQL

Step 8—Save processed data (Parquet, CSV, JSON)

Bonus: Configuring a Spark Project with sbt (Scala Build Tool)

What is sbt (Scala Build Tool)?

Step 1—Install sbt

Step 2—Setting up the project structure

Step 3—Create the build.sbt file

Step 4—Create build.properties file

Step 5—Write your first Scala code

Step 6—Compile and run the project locally

Step 7—Add data for processing

Step 8—Update build.sbt for Spark dependencies

Step 9—Implement your Spark job logic

Step 10—Update the main object to call the analysis logic

Step 11—Compile and run Your Spark job locally

Step 12—Configure logging

Step 13—Package your application as a JAR

Step 14—Packaging using sbt-assembly to create a fat JAR

Step 15—Submit your JAR to Spark cluster

Further reading

Conclusion

FAQs

What is Apache Spark?

How does Apache Spark achieve fault tolerance?

What is lazy evaluation in Spark?

What role does the driver program play in a Spark application?

How do executors function within a Spark cluster?

What is Scala with Spark?

How do I run Scala in Spark?

How do you verify your installation of Spark with Scala?