Spark Scala + IntelliJ on Windows: Step-by-Step Guide to Writing to Hive

Apache Spark is a powerful distributed computing framework, and when combined with Scala, it provides a robust environment for big data processing. This guide will walk you through setting up Spark with Scala in IntelliJ on Windows and writing data to Apache Hive.

Watch on YouTube

Spark Scala + IntelliJ on Windows

Prerequisites

Before proceeding, ensure you have the following installed:

Java (JDK 8 or later) – Required for Spark execution
Scala (Latest version) – Used for Spark programming
Apache Spark – A powerful data processing engine
IntelliJ IDEA (Community or Ultimate Edition) – IDE for Scala development
SBT (Scala Build Tool) – To manage dependencies
Hadoop & Hive – Hive for SQL-like querying

Step 1: Install and Configure Java & Scala

Download and install Java JDK 8 or later from Oracle’s website.
Set JAVA_HOME in your environment variables.
Download and install Scala from Scala’s official website.
Verify installation by running:
```
java -version
scala -version
```

Step 2: Install and Set Up Apache Spark

Download Apache Spark from Spark’s official website.
Extract the Spark folder and set environment variables:
- SPARK_HOME → Path to extracted Spark folder
- Add %SPARK_HOME%\bin to the system PATH
Verify Spark installation:
```
spark-shell
```

Step 3: Install IntelliJ IDEA and Set Up Scala Plugin

Download and install IntelliJ IDEA from JetBrains.
Open IntelliJ, navigate to File → Settings → Plugins, search for Scala, and install it.

Step 4: Create a Spark Scala Project with SBT

Open IntelliJ and select New Project.
Choose Scala and select SBT as the build tool.
Set Project SDK to JDK 8 or later.
Click Finish to create the project.

Modify build.sbt to include Spark dependencies:

name := "SparkHiveExample"
version := "1.0"
scalaVersion := "2.12.15"

libraryDependencies ++= Seq(
    "org.apache.spark" %% "spark-core" % "3.2.1",
    "org.apache.spark" %% "spark-sql" % "3.2.1",
    "org.apache.hive" % "hive-jdbc" % "3.1.2"
)

Click Refresh on the SBT panel to download dependencies.

Step 5: Configure Hive and Spark Integration

Install Hadoop and Hive:
- Download Hadoop and Hive binaries.
- Set HADOOP_HOME and HIVE_HOME in environment variables.

Configure hive-site.xml inside conf/ directory of Spark:

<configuration>
    <property>
        <name>hive.metastore.uris</name>
        <value>thrift://localhost:9083</value>
    </property>
</configuration>

Start the Hive metastore:
```
hive --service metastore &
```

Step 6: Write Data to Hive Using Spark

Create a Scala object SparkHiveExample.scala and add the following code:

import org.apache.spark.sql.SparkSession

object SparkHiveExample {
  def main(args: Array[String]): Unit = {
    val spark = SparkSession.builder()
      .appName("Spark Hive Example")
      .config("spark.sql.catalogImplementation", "hive")
      .enableHiveSupport()
      .getOrCreate()

    import spark.implicits._

    // Create a DataFrame
    val df = Seq((1, "Alice"), (2, "Bob"), (3, "Charlie"))
      .toDF("id", "name")
    
    // Write DataFrame to Hive table
    df.write.mode("overwrite").saveAsTable("users")
    
    // Read from Hive
    val dfRead = spark.sql("SELECT * FROM users")
    dfRead.show()

    spark.stop()
  }
}

Step 7: Run the Spark Hive Application

Open a terminal in IntelliJ and navigate to the project directory.
Compile and run the project:
```
sbt package
sbt run
```
Open Hive and verify the table:
```
SELECT * FROM users;
```

Conclusion

You’ve successfully set up Spark with Scala in IntelliJ on Windows and written data to Hive. This setup enables you to perform big data processing and SQL-based querying efficiently. 🚀

FutureX Skills

Wednesday, March 19, 2025

Spark Scala + IntelliJ on Windows: Step-by-Step Guide to Writing to Hive

Spark Scala + IntelliJ on Windows: Step-by-Step Guide to Writing to Hive

Watch on YouTube