Databricks Wheel Job

author:BZdate:2024-11-01

Databricks Jobs

Recently I successfully deploy my python wheel to Databricks Cluster. Here are some tips if you plan to deploy pyspark.

pyspark project
pytest

`pyspark` project

My previous spark project is scala based and I use IDEA to compile and test conveniently.

:smile::smile::smile:

Databricks Job nice UI save your time to create JAR job.

This is official guide: Databricks Wheel Job

What I did:

Initialize a python project

1
# create python virtual environment
2
python -m venv pyspark_venv
3

4
# active your venv
5
source pyspark_venv/bin/activate
6

7
# check your current python
8
which python
9

10
# install python lib
11
pip install uv ruff pyspark pytest wheel
12

13
## if pip failed at proxy error
14
## adding your proxy
15
## --proxy http://proxy:port
16

17
# create your project
18
uv init --package <your package name>

After uv command complete, a nice python project is created.

1
pyspark-app
2
├── README.md
3
├── pyproject.toml
4
└── src
5
    └── pyspark_app
6
        └── __init__.py

:exclamation: pyspark entry point

add one file __main__.py in pyspark_app
modify [project.scripts] in pyproject.toml and this is entry point of Databricks job

Now the project is

1
pyspark-app
2
├── README.md
3
├── pyproject.toml
4
└── src
5
    └── pyspark_app
6
        ├── __init__.py
7
        └── __main__.py

`pytest`

Please check your pytest installed. Let create a new package test

1
pyspark-app
2
├── README.md
3
├── pyproject.toml
4
└── src
5
    └── pyspark_app
6
        ├── __init__.py
7
        ├── __main__.py
8
        └── test
9
            ├── __init__.py
10
            ├── conftest.py
11
            └── test_spark.py

1
def test_spark(init_spark):
2
    spark = init_spark
3
    df = spark.range(10)
4
    df.show()
5

6
""" output
7
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
8
24/11/01 20:59:42 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
9
PASSED                                         [100%]+---+
10
| id|
11
+---+
12
|  0|
13
|  1|
14
|  2|
15
|  3|
16
|  4|
17
|  5|
18
|  6|
19
|  7|
20
|  8|
21
|  9|
22
+---+
23
"""

Now you can work on your spark application with test

wheel file

Final step is building wheel file

1
# 1. change your work directory to pyproject.toml
2
# 2. run below command
3
python -m build --wheel
4

5
# project is now changing to
6

7
pyspark-app
8
├── README.md
9
├── build
10
│   ├── bdist.macosx-12.0-x86_64
11
│   └── lib
12
│       └── pyspark_app
13
│           ├── __init__.py
14
│           ├── __main__.py
15
│           └── test
16
│               ├── __init__.py
17
│               ├── conftest.py
18
│               └── test_spark.py
19
├── dist
20
│   └── pyspark_app-0.1.0-py3-none-any.whl
21
├── pyproject.toml
22
└── src
23
    ├── pyspark_app
24
    │   ├── __init__.py
25
    │   ├── __main__.py
26
    │   └── test
27
    │       ├── __init__.py
28
    │       ├── conftest.py
29
    │       └── test_spark.py
30
    └── pyspark_app.egg-info
31
        ├── PKG-INFO
32
        ├── SOURCES.txt
33
        ├── dependency_links.txt
34
        ├── entry_points.txt
35
        └── top_level.txt

Your wheel file is at line 20

Go to view all at Project template

Databricks Wheel Job

Databricks Jobs

pyspark project

pytest

wheel file

`pyspark` project

`pytest`