Skip to content
GitHubDiscordThreads

Databricks Wheel Job

Recently I successfully deploy my python wheel to Databricks Cluster. Here are some tips if you plan to deploy pyspark.

  • pyspark project
  • pytest

My previous spark project is scala based and I use IDEA to compile and test conveniently.

:smile::smile::smile:

Databricks Job nice UI save your time to create JAR job.

This is official guide: Databricks Wheel Job

What I did:

  1. Initialize a python project

    Terminal window
    # create python virtual environment
    python -m venv pyspark_venv
    # active your venv
    source pyspark_venv/bin/activate
    # check your current python
    which python
    # install python lib
    pip install uv ruff pyspark pytest wheel
    ## if pip failed at proxy error
    ## adding your proxy
    ## --proxy http://proxy:port
    # create your project
    uv init --package <your package name>

    After uv command complete, a nice python project is created.

    Terminal window
    pyspark-app
    ├── README.md
    ├── pyproject.toml
    └── src
    └── pyspark_app
    └── __init__.py
  2. :exclamation: pyspark entry point

    • add one file __main__.py in pyspark_app
    • modify [project.scripts] in pyproject.toml and this is entry point of Databricks job

    Now the project is

    Terminal window
    pyspark-app
    ├── README.md
    ├── pyproject.toml
    └── src
    └── pyspark_app
    ├── __init__.py
    └── __main__.py

Please check your pytest installed. Let create a new package test

Terminal window
pyspark-app
├── README.md
├── pyproject.toml
└── src
└── pyspark_app
├── __init__.py
├── __main__.py
└── test
├── __init__.py
├── conftest.py
└── test_spark.py
test_spark
def test_spark(init_spark):
spark = init_spark
df = spark.range(10)
df.show()
""" output
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/11/01 20:59:42 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
PASSED [100%]+---+
| id|
+---+
| 0|
| 1|
| 2|
| 3|
| 4|
| 5|
| 6|
| 7|
| 8|
| 9|
+---+
"""

Now you can work on your spark application with test

Final step is building wheel file

Terminal window
# 1. change your work directory to pyproject.toml
# 2. run below command
python -m build --wheel
# project is now changing to
pyspark-app
├── README.md
├── build
│   ├── bdist.macosx-12.0-x86_64
│   └── lib
│   └── pyspark_app
│   ├── __init__.py
│   ├── __main__.py
│   └── test
│   ├── __init__.py
│   ├── conftest.py
│   └── test_spark.py
├── dist
│   └── pyspark_app-0.1.0-py3-none-any.whl
├── pyproject.toml
└── src
├── pyspark_app
│   ├── __init__.py
│   ├── __main__.py
│   └── test
│   ├── __init__.py
│   ├── conftest.py
│   └── test_spark.py
└── pyspark_app.egg-info
├── PKG-INFO
├── SOURCES.txt
├── dependency_links.txt
├── entry_points.txt
└── top_level.txt

Your wheel file is at line 20

Go to view all at Project template