PySpark setup, common errors Py4JError and StorageUtils$
As Java developer involved in a Python project is funny to see errors generated by the JVM in the console.
This errors could scare a python developer but are kind of reconforting a java developer knowing that is only apparently working in a different framework.
PySpark is translating instructions from Python to Scala and running them in a JVM runtime.
If you get this error
java.lang.NoClassDefFoundError: Could not initialize class org.apache.spark.storage.StorageUtils$
Probably your Java version is not supported by Spark.
Spark 3.2.0 doesn't officialy support Java 17. The official support only comes with Spark 3.3.0
Here you can see the changements and their status done in the Spark project to be compatible with Java 17 ... and even with Spark 3.3 you risk to have have some issues running your program.
To avoid the runtime issues the Spark team added a launcher to use With Java 17:
In some forums developers tried to fix the modules issue adding to the jvm options this instruction:
In our case, because of the critical importance of the application we prefer to be safe and we stayed with Java 11 as runtime for this application.
We won't change until when there will be a widely adopted / official solution.
Another issue we found is the different version between pyspark and spark.
py4.protocol.Py4JError: An error occurred while calling None.org.apache.spark.sql.SparkSession. Trace:
In this case whe had to be sure to use the same version of PySpark and Spark, 3.2.2 in our case.
To test your PySpark version you can use the command
The result should be similar to this one:
>pyspark --version Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 3.2.2 /_/ Using Scala version 2.12.15, OpenJDK 64-Bit Server VM, 11.0.18 Branch HEAD
Start the standalone Spark cluster on Localhost
Spark is thought to be a distribuited system to test in a local environment, for local development you can start a standalone server with the command:
If everything works correctly you should be able to see the version of Spark e.g.:
23/04/23 17:10:12 INFO Utils: Successfully started service 'sparkMaster' on port 7077. 23/04/23 17:10:12 INFO Master: Starting Spark master at spark://192.168.1.1:7077 23/04/23 17:10:12 INFO Master: Running Spark version 3.2.2 23/04/23 17:10:13 INFO Utils: Successfully started service 'MasterUI' on port 8080. 23/04/23 17:10:13 INFO MasterWebUI: Bound MasterWebUI to 0.0.0.0, and started at http://host.docker.internal:8080 23/04/23 17:10:13 INFO Master: I have been elected leader! New state: ALIVE