PySpark connection to PostgreSQL ... errors and solutions

Updated: 2023-01-03

Marco Molteni

Connect PySpark to Postgres

The goal is to connect the spark session to an instance of PostgreSQL and return some data.

It's possible to set the configuration in the configuration of the environment.
I solved the issue directly in the .ipynb.

To create the connection you need:

the jdbc driver accessible, you can donwload the driver directly from the PostgreSQL website
a lot of memory

java.lang.OutOfMemoryError during the execution of queries

I had issues to initialize the Spark Session and to retrieve the data from Postgres because I didn't assign enough resources.

This error showed itself pretty often:

[Stage 7:>                                                          (0 + 1) / 1] 
Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "refresh progress" 
 
Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "executor-kill-mark-cleanup" 
 
Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "Idle Worker Monitor for python3" 
 
Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "QueryStageCreator-1" 
 
Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "RemoteBlock-temp-file-clean-thread"

As solution, I added the configuration directly in the instantiation of the session:
spark.driver.memory, spark.executor.memory, spark.driver.maxResultSize.

from pyspark import SparkContext 
from pyspark.sql import SparkSession 
 
spark = SparkSession \ 
    .builder \ 
    .appName("Python Spark SQL basic example") \ 
    .config("spark.jars", "/Users/marco/dev/jdbc/postgresql-42.5.1.jar") \ 
    .config("spark.driver.memory","4g")\ 
    .config("spark.executor.memory","2g")\ 
    .config("spark.driver.maxResultSize","1g")\ 
    .getOrCreate()

I declared the url to the database in a variable:

url = 'jdbc:postgresql://192.168.1.111:5432/exampledb'

I added the properties of the connection to a dictionary. I had to explicitly declare the driver used.

Py4JJavaError: An error occurred while calling o34.jdbc. : java.sql.SQLException: No suitable driver

Without "driver":"org.postgresql.Driver" the following error was shown:

Py4JJavaError: An error occurred while calling o34.jdbc. 
: java.sql.SQLException: No suitable driver 
	at java.sql/java.sql.DriverManager.getDriver(DriverManager.java:300) 
	at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.$anonfun$driverClass$2(JDBCOptions.scala:107) 
	at scala.Option.getOrElse(Option.scala:189) 
	at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.<init>(JDBCOptions.scala:107) 
	at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.<init>(JDBCOptions.scala:39)

Here an example of definition of the properties that worked in my case:

properties_local = { 
    'user': 'me', 
    'password': 'my_password', 
    "driver": "org.postgresql.Driver" 
}

To retrieve the data from a table in the database I used a PySpark DataTable and passed the parameters previously defined:

df_data = spark.read.jdbc( 
    url=url, 
    table="table_name_in_the_database", 
    properties=properties)

Next round of errors

I will update the page with the next issues that I will discover. I guess that if you use MySQL you will encounter the same issues.