Spark Structured Streaming

author:Bin Zhangdate:2020-02-08

Spark Structured Streaming

Recently reading a blog Structured Streaming in PySpark It’s implemented in Databricks platform. Then I try to implement in my local Spark. Some tricky issue happened during my work.

Reading Data

1
from pyspark.sql import SparkSession
2
from pyspark.sql.types import TimestampType, StringType, StructType, StructField
3

4
spark = SparkSession.builder.appName("Test Streaming").enableHiveSupport().getOrCreate()
5

6
json_schema = StructType([
7
    StructField("time", TimestampType(), True),
8
    StructField("customer", StringType(), True),
9
    StructField("action", StringType(), True),
10
    StructField("device", StringType(), True)
11
])
12

13
file_path = "local_file_path<file:///..."

read json as same as method in the blog

1
input = spark.read.schema(json_schema).json(file_path)
2

3
input.show()
4
# +----+--------+------+------+
5
# |time|customer|action|device|
6
# +----+--------+------+------+
7
# |null|    null|  null|  null|
8
# |null|    null|  null|  null|
9
# |null|    null|  null|  null|
10
# |null|    null|  null|  null|
11
# |null|    null|  null|  null|
12
# |null|    null|  null|  null|
13
# |null|    null|  null|  null|
14
# |null|    null|  null|  null|
15
# |null|    null|  null|  null|
16
# |null|    null|  null|  null|
17
# |null|    null|  null|  null|
18
# |null|    null|  null|  null|
19
# |null|    null|  null|  null|
20
# |null|    null|  null|  null|
21
# |null|    null|  null|  null|
22
# |null|    null|  null|  null|
23
# |null|    null|  null|  null|
24
# |null|    null|  null|  null|
25
# |null|    null|  null|  null|
26
# |null|    null|  null|  null|
27
# +----+--------+------+------+
28
input.count()
29
# 20000

All values are null, however, the count is right. It means spark has already read all data but the schema is not correctly mapped.

read a single json file to check schema

1
input = spark.read.schema(json_schema).json(file_path+'/1.json')
2

3
input.show()
4

5
# +----+--------+------+------+
6
# |time|customer|action|device|
7
# +----+--------+------+------+
8
# |null|    null|  null|  null|
9
# |null|    null|  null|  null|
10
# |null|    null|  null|  null|
11
# |null|    null|  null|  null|
12
# |null|    null|  null|  null|
13
# |null|    null|  null|  null|
14
# |null|    null|  null|  null|
15
# |null|    null|  null|  null|
16
# |null|    null|  null|  null|
17
# |null|    null|  null|  null|
18
# |null|    null|  null|  null|
19
# |null|    null|  null|  null|
20
# |null|    null|  null|  null|
21
# |null|    null|  null|  null|
22
# |null|    null|  null|  null|
23
# |null|    null|  null|  null|
24
# |null|    null|  null|  null|
25
# |null|    null|  null|  null|
26
# |null|    null|  null|  null|
27
# |null|    null|  null|  null|
28
# +----+--------+------+------+
29

30
# same error
31
# Then I drop schema option and use inferSchema
32
input = spark.read.json(file_path+'/1.json')
33

34
input.show()
35

36
# +--------------------+-----------+-----------------+--------------------+---------------+
37
# |     _corrupt_record|     action|         customer|              device|           time|
38
# +--------------------+-----------+-----------------+--------------------+---------------+
39
# |[{"time":"3:57:09...|       null|             null|                null|           null|
40
# |                null|  power off|Nicolle Pargetter| August Doorbell Cam| 1:29:05.000 AM|
41
# |                null|   power on|   Concordia Muck|Footbot Air Quali...| 6:02:06.000 AM|
42
# |                null|  power off| Kippar McCaughen|             ecobee4| 5:40:19.000 PM|
43
# |                null|  power off|    Sidney Jotham|  GreenIQ Controller| 4:54:28.000 PM|
44
# |                null|  power off|    Fanya Menzies|             ecobee4| 3:12:48.000 PM|
45
# |                null|low battery|    Jeanne Gresch|             ecobee4| 5:39:47.000 PM|
46
# |                null|   power on|    Chen Cuttelar| August Doorbell Cam| 2:45:44.000 PM|
47
# |                null|  power off|       Merwyn Mix|         Amazon Echo| 9:23:41.000 PM|
48
# |                null|  power off| Angelico Conrath|         Amazon Echo| 4:53:13.000 AM|
49
# |                null|   power on|     Gilda Emmett| August Doorbell Cam|12:32:29.000 AM|
50
# |                null|low battery|  Austine Davsley|             ecobee4| 3:35:12.000 AM|
51
# |                null|low battery| Zackariah Thoday|         Amazon Echo| 1:26:13.000 PM|
52
# |                null|  power off|     Ewen Gillson|         Amazon Echo| 7:47:20.000 AM|
53
# |                null|   power on|     Itch Durnill|             ecobee4| 4:45:55.000 AM|
54
# |                null|  power off|        Winni Dow|  GreenIQ Controller| 4:12:54.000 AM|
55
# |                null|   power on|Talbot Valentelli| August Doorbell Cam| 7:35:23.000 PM|
56
# |                null|low battery|    Vikki Muckeen| August Doorbell Cam| 1:17:30.000 PM|
57
# |                null|  power off|  Christie Karran|Footbot Air Quali...| 9:38:13.000 PM|
58
# |                null|low battery|     Evonne Guest|         Amazon Echo| 8:02:21.000 AM|
59
# +--------------------+-----------+-----------------+--------------------+---------------+

A weird column is _corrupt_record and first value is [{“time”:“3:57:09… in this column. Go back to check source file and notice that it’s a list of object in json file.

Remove [ and ] in source file

1
input = spark.read.json(file_path+'/1.json')
2

3
input.show()
4

5
# +-----------+-----------------+--------------------+---------------+
6
# |     action|         customer|              device|           time|
7
# +-----------+-----------------+--------------------+---------------+
8
# |  power off|      Alexi Barts|  GreenIQ Controller| 3:57:09.000 PM|
9
# |  power off|Nicolle Pargetter| August Doorbell Cam| 1:29:05.000 AM|
10
# |   power on|   Concordia Muck|Footbot Air Quali...| 6:02:06.000 AM|
11
# |  power off| Kippar McCaughen|             ecobee4| 5:40:19.000 PM|
12
# |  power off|    Sidney Jotham|  GreenIQ Controller| 4:54:28.000 PM|
13
# |  power off|    Fanya Menzies|             ecobee4| 3:12:48.000 PM|
14
# |low battery|    Jeanne Gresch|             ecobee4| 5:39:47.000 PM|
15
# |   power on|    Chen Cuttelar| August Doorbell Cam| 2:45:44.000 PM|
16
# |  power off|       Merwyn Mix|         Amazon Echo| 9:23:41.000 PM|
17
# |  power off| Angelico Conrath|         Amazon Echo| 4:53:13.000 AM|
18
# |   power on|     Gilda Emmett| August Doorbell Cam|12:32:29.000 AM|
19
# |low battery|  Austine Davsley|             ecobee4| 3:35:12.000 AM|
20
# |low battery| Zackariah Thoday|         Amazon Echo| 1:26:13.000 PM|
21
# |  power off|     Ewen Gillson|         Amazon Echo| 7:47:20.000 AM|
22
# |   power on|     Itch Durnill|             ecobee4| 4:45:55.000 AM|
23
# |  power off|        Winni Dow|  GreenIQ Controller| 4:12:54.000 AM|
24
# |   power on|Talbot Valentelli| August Doorbell Cam| 7:35:23.000 PM|
25
# |low battery|    Vikki Muckeen| August Doorbell Cam| 1:17:30.000 PM|
26
# |  power off|  Christie Karran|Footbot Air Quali...| 9:38:13.000 PM|
27
# |low battery|     Evonne Guest|         Amazon Echo| 8:02:21.000 AM|
28
# +-----------+-----------------+--------------------+---------------+

Woo, the dataframe is correct. Let’s check schema

1
input.printSchema()
2
# root
3
#  |-- action: string (nullable = true)
4
#  |-- customer: string (nullable = true)
5
#  |-- device: string (nullable = true)
6
#  |-- time: string (nullable = true)

So far I manually modify source file and drop external schema to obtain a corret dataframe. Is there anyway to read these files without these steps.

add one feature multiLine

Read the file without schema but add one feature multiLine

1
input = spark.read.json("file:///path/pyspark_test_data", multiLine=True)
2

3
# OR input = spark.read.option('multiLine', True).json("file:///path/pyspark_test_data")
4

5
# +-----------+--------------------+--------------------+---------------+
6
# |     action|            customer|              device|           time|
7
# +-----------+--------------------+--------------------+---------------+
8
# |   power on|     Raynor Blaskett|Nest T3021US Ther...| 3:35:09.000 AM|
9
# |   power on|Stafford Blakebrough|  GreenIQ Controller|10:59:46.000 AM|
10
# |   power on|      Alex Woolcocks|Nest T3021US Ther...| 6:26:36.000 PM|
11
# |   power on|      Clarice Nayshe|Footbot Air Quali...| 4:46:28.000 AM|
12
# |  power off|      Killie Pirozzi|Footbot Air Quali...| 8:58:43.000 AM|
13
# |   power on|    Lynne Dymidowicz|Footbot Air Quali...| 4:20:49.000 PM|
14
# |   power on|       Shaina Dowyer|             ecobee4| 3:41:33.000 AM|
15
# |low battery|       Barbee Melato| August Doorbell Cam|10:40:24.000 PM|
16
# |  power off|        Clem Westcot|Nest T3021US Ther...|11:13:38.000 PM|
17
# |  power off|       Kerri Galfour|         Amazon Echo|10:12:15.000 PM|
18
# |low battery|        Trev Ashmore|  GreenIQ Controller|11:04:41.000 AM|
19
# |   power on|      Coral Jahnisch| August Doorbell Cam| 3:06:31.000 AM|
20
# |   power on|      Feliza Cowdrey|Nest T3021US Ther...| 2:49:02.000 AM|
21
# |  power off|   Amabelle De Haven|Footbot Air Quali...|12:11:59.000 PM|
22
# |  power off|     Benton Redbourn|Nest T3021US Ther...| 3:57:39.000 AM|
23
# |low battery|        Asher Potten| August Doorbell Cam| 1:34:44.000 AM|
24
# |low battery|    Lorianne Hullyer| August Doorbell Cam| 7:26:42.000 PM|
25
# |  power off|     Ruperto Aldcorn|Footbot Air Quali...| 3:54:49.000 AM|
26
# |   power on|   Agatha Di Giacomo|Footbot Air Quali...| 7:15:20.000 AM|
27
# |   power on|    Eunice Penwright|             ecobee4|11:14:14.000 PM|
28
# +-----------+--------------------+--------------------+---------------+
29

30
input.printSchema()
31

32
# root
33
#  |-- action: string (nullable = true)
34
#  |-- customer: string (nullable = true)
35
#  |-- device: string (nullable = true)
36
#  |-- time: string (nullable = true)

change the schema

Set time as StringType

1
json_schema = StructType([
2
    StructField("time", StringType(), True),
3
    StructField("customer", StringType(), True),
4
    StructField("action", StringType(), True),
5
    StructField("device", StringType(), True)
6
])
7

8

9
input = spark.read.schema(json_schema).json("file:///path/pyspark_test_data", multiLine=True)
10

11
input.show()
12

13
# +---------------+--------------------+-----------+--------------------+
14
# |           time|            customer|     action|              device|
15
# +---------------+--------------------+-----------+--------------------+
16
# | 3:35:09.000 AM|     Raynor Blaskett|   power on|Nest T3021US Ther...|
17
# |10:59:46.000 AM|Stafford Blakebrough|   power on|  GreenIQ Controller|
18
# | 6:26:36.000 PM|      Alex Woolcocks|   power on|Nest T3021US Ther...|
19
# | 4:46:28.000 AM|      Clarice Nayshe|   power on|Footbot Air Quali...|
20
# | 8:58:43.000 AM|      Killie Pirozzi|  power off|Footbot Air Quali...|
21
# | 4:20:49.000 PM|    Lynne Dymidowicz|   power on|Footbot Air Quali...|
22
# | 3:41:33.000 AM|       Shaina Dowyer|   power on|             ecobee4|
23
# |10:40:24.000 PM|       Barbee Melato|low battery| August Doorbell Cam|
24
# |11:13:38.000 PM|        Clem Westcot|  power off|Nest T3021US Ther...|
25
# |10:12:15.000 PM|       Kerri Galfour|  power off|         Amazon Echo|
26
# |11:04:41.000 AM|        Trev Ashmore|low battery|  GreenIQ Controller|
27
# | 3:06:31.000 AM|      Coral Jahnisch|   power on| August Doorbell Cam|
28
# | 2:49:02.000 AM|      Feliza Cowdrey|   power on|Nest T3021US Ther...|
29
# |12:11:59.000 PM|   Amabelle De Haven|  power off|Footbot Air Quali...|
30
# | 3:57:39.000 AM|     Benton Redbourn|  power off|Nest T3021US Ther...|
31
# | 1:34:44.000 AM|        Asher Potten|low battery| August Doorbell Cam|
32
# | 7:26:42.000 PM|    Lorianne Hullyer|low battery| August Doorbell Cam|
33
# | 3:54:49.000 AM|     Ruperto Aldcorn|  power off|Footbot Air Quali...|
34
# | 7:15:20.000 AM|   Agatha Di Giacomo|   power on|Footbot Air Quali...|
35
# |11:14:14.000 PM|    Eunice Penwright|   power on|             ecobee4|
36
# +---------------+--------------------+-----------+--------------------+

Pyspark can load json files successfully without TimestampType. However, how to handle timestamp issue in this job?

TimestampType

In offical document, the class pyspark.sql.DataFrameReader has one parameter

timestampFormat

sets the string that indicates a timestamp format.

Custom date formats follow the formats at java.text.SimpleDateFormat.

This applies to timestamp type. If None is set, it uses the default value, yyyy-MM-dd’T’HH:mm:ss.SSSXXX.

1
input = spark.read.schema(schema).option("multiLine", True).json("file:///path/pyspark_test_data", timestampFormat="h:mm:ss.SSS aa")
2

3
input.show()
4
# +-------------------+--------------------+-----------+--------------------+
5
# |               time|            customer|     action|              device|
6
# +-------------------+--------------------+-----------+--------------------+
7
# |1970-01-01 03:35:09|     Raynor Blaskett|   power on|Nest T3021US Ther...|
8
# |1970-01-01 10:59:46|Stafford Blakebrough|   power on|  GreenIQ Controller|
9
# |1970-01-01 18:26:36|      Alex Woolcocks|   power on|Nest T3021US Ther...|
10
# |1970-01-01 04:46:28|      Clarice Nayshe|   power on|Footbot Air Quali...|
11
# |1970-01-01 08:58:43|      Killie Pirozzi|  power off|Footbot Air Quali...|
12
# |1970-01-01 16:20:49|    Lynne Dymidowicz|   power on|Footbot Air Quali...|
13
# |1970-01-01 03:41:33|       Shaina Dowyer|   power on|             ecobee4|
14
# |1970-01-01 22:40:24|       Barbee Melato|low battery| August Doorbell Cam|
15
# |1970-01-01 23:13:38|        Clem Westcot|  power off|Nest T3021US Ther...|
16
# |1970-01-01 22:12:15|       Kerri Galfour|  power off|         Amazon Echo|
17
# |1970-01-01 11:04:41|        Trev Ashmore|low battery|  GreenIQ Controller|
18
# |1970-01-01 03:06:31|      Coral Jahnisch|   power on| August Doorbell Cam|
19
# |1970-01-01 02:49:02|      Feliza Cowdrey|   power on|Nest T3021US Ther...|
20
# |1970-01-01 12:11:59|   Amabelle De Haven|  power off|Footbot Air Quali...|
21
# |1970-01-01 03:57:39|     Benton Redbourn|  power off|Nest T3021US Ther...|
22
# |1970-01-01 01:34:44|        Asher Potten|low battery| August Doorbell Cam|
23
# |1970-01-01 19:26:42|    Lorianne Hullyer|low battery| August Doorbell Cam|
24
# |1970-01-01 03:54:49|     Ruperto Aldcorn|  power off|Footbot Air Quali...|
25
# |1970-01-01 07:15:20|   Agatha Di Giacomo|   power on|Footbot Air Quali...|
26
# |1970-01-01 23:14:14|    Eunice Penwright|   power on|             ecobee4|
27
# +-------------------+--------------------+-----------+--------------------+

All yyyy-MM-dd are 1970-01-01 because source file only hh-mm-ss. These source files are in wrong format in Windows.

Streaming Our Data

1
from pyspark.sql import SparkSession
2
from pyspark.sql.types import TimestampType, StringType, StructType, StructField
3

4

5
spark = SparkSession.builder.appName("Test Streaming").enableHiveSupport().getOrCreate()
6

7
json_schema = StructType([
8
    StructField("time", StringType(), True),
9
    StructField("customer", StringType(), True),
10
    StructField("action", StringType(), True),
11
    StructField("device", StringType(), True)
12
])
13

14
streamingDF = spark.readStream.schema(json_schema) \
15
              .option("maxFilesPerTrigger", 1) \
16
              .option("multiLine", True) \
17
              .json("file:///path/pyspark_test_data")
18

19
streamingActionCountsDF = streamingDF.groupBy('action').count()
20
# streamingActionCountsDF.isStreaming
21
spark.conf.set("spark.sql.shuffle.partitions", "2")
22

23

24
# View stream in real-time
25
# query = streamingActionCountsDF.writeStream \
26
#         .format("memory").queryName("counts").outputMode("complete").start()
27

28
# format choice:
29
# parquet
30
# kafka
31
# console
32
# memory
33

34
# query = streamingActionCountsDF.writeStream \
35
#         .format("console").queryName("counts").outputMode("complete").start()
36

37
query = streamingActionCountsDF.writeStream.format("console") \
38
        .queryName("counts").outputMode("complete").start().awaitTermination(timeout=10)
39
# Output Mode choice:
40
# append
41
# complete
42
# update