PySpark setup, common errors Py4JError and StorageUtils$

As Java developer involved in a Python project is funny to see errors generated by the JVM in the console.

This errors could scare a python developer but are kind of reconforting a java developer knowing that is only apparently working in a different framework.

PySpark is translating instructions from Python to Scala and running them in a JVM runtime.

Common errors

StorageUtils$ error

If you get this error

java.lang.NoClassDefFoundError: Could not initialize class$ 

Probably your Java version is not supported by Spark.

Spark 3.2.0 doesn't officialy support Java 17. The official support only comes with Spark 3.3.0

Here you can see the changements and their status done in the Spark project to be compatible with Java 17 ... and even with Spark 3.3 you risk to have have some issues running your program.

To avoid the runtime issues the Spark team added a launcher to use With Java 17:

In some forums developers tried to fix the modules issue adding to the jvm options this instruction:

--add-exports java.base/ 

In our case, because of the critical importance of the application we prefer to be safe and we stayed with Java 11 as runtime for this application.

We won't change until when there will be a widely adopted / official solution.


Another issue we found is the different version between pyspark and spark.

py4.protocol.Py4JError: An error occurred while calling Trace: 

In this case whe had to be sure to use the same version of PySpark and Spark, 3.2.2 in our case.

To test your PySpark version you can use the command

pyspark --version 

The result should be similar to this one:

>pyspark --version 
Welcome to 
      ____              __ 
     / __/__  ___ _____/ /__ 
    _\ \/ _ \/ _ `/ __/  '_/ 
   /___/ .__/\_,_/_/ /_/\_\   version 3.2.2 
Using Scala version 2.12.15, OpenJDK 64-Bit Server VM, 11.0.18 
Branch HEAD 

Start the standalone Spark cluster on Localhost

Spark is thought to be a distribuited system to test in a local environment, for local development you can start a standalone server with the command:

spark-class org.apache.spark.deploy.master.Master 

If everything works correctly you should be able to see the version of Spark e.g.:

23/04/23 17:10:12 INFO Utils: Successfully started service 'sparkMaster' on port 7077. 
23/04/23 17:10:12 INFO Master: Starting Spark master at spark:// 
23/04/23 17:10:12 INFO Master: Running Spark version 3.2.2 
23/04/23 17:10:13 INFO Utils: Successfully started service 'MasterUI' on port 8080. 
23/04/23 17:10:13 INFO MasterWebUI: Bound MasterWebUI to, and started at http://host.docker.internal:8080 
23/04/23 17:10:13 INFO Master: I have been elected leader! New state: ALIVE 

You could be interested in

logo of the category: python-logo.svg PySpark connection to PostgreSQL ... errors and solutions

How to connect PySpark to PostgreSQL

Enums in Angular templates

How to use enum in the html template

Super fast WebApp built by Marco using SpringBoot 3 and Java 17 hosted by Infomaniak in Switzerland