Skip to main content

Spark SQL Dialect

The Spark SQL dialect extends the base SQL dialect with backtick-quoted column identifiers, matching Apache Spark's SQL syntax.

Dependency

libraryDependencies += "com.eff3ct" %% "criteria4s-sparksql" % "1.0.0"
note

Apache Spark itself currently requires Scala 2.13, but the criteria4s-sparksql module compiles with Scala 3. It generates SQL expression strings that you can pass to Spark's where() / filter() methods. The criteria4s expression module does not depend on Spark at runtime — it only produces strings.

Import Pattern

import com.eff3ct.criteria4s.core.*
import com.eff3ct.criteria4s.dialect.sparksql.{*, given}
import com.eff3ct.criteria4s.functions as F
import com.eff3ct.criteria4s.extensions.*

Column Quoting

Spark SQL uses backtick-quoted identifiers, identical to MySQL:

val column = summon[Show[Column, SparkSQL]]
// column: Show[Column, SparkSQL] = com.eff3ct.criteria4s.core.Show$$$Lambda$2369/0x00007ff230779010@8e68a8b
column.show(Column("user_name"))
// res0: String = "`user_name`"

All predicates render with backtick-quoted column names:

F.===[SparkSQL, Column, Int](F.col("age"), F.lit(30)).value
// res1: String = "`age` = 30"

Inherited Operations

SparkSQL inherits every operation from the base SQL dialect. All predicates, conjunctions, transforms, ordering, LIMIT/OFFSET, and CASE WHEN work identically — only the column quoting differs.

Predicate Examples

// Comparison
F.gt[SparkSQL, Column, Int](F.col("age"), F.lit(18)).value
// res2: String = "`age` > 18"

// Pattern matching
F.like[SparkSQL, Column, String](F.col("name"), F.lit("%John%")).value
// res3: String = "`name` LIKE %John%"

// Set membership
F.in[SparkSQL, Column, Seq[String]](
F.col("status"), F.array[SparkSQL, String]("active", "pending")
).value
// res4: String = "`status` IN (active, pending)"

// Null checks
F.isNull[SparkSQL, Column](F.col("partition_date")).value
// res5: String = "`partition_date` IS NULL"

// Range
F.between[SparkSQL, Column, (Int, Int)](
F.col("score"), F.range[SparkSQL, Int](70, 100)
).value
// res6: String = "`score` BETWEEN 70 AND 100"

Practical Examples

DataFrame filter expression

criteria4s generates the filter string, which you then pass to Spark's string-based filter API:

val sparkFilter = F.col[SparkSQL]("age")
.geq(F.lit[SparkSQL, Int](18))
.and(F.col[SparkSQL]("country") === F.lit[SparkSQL, String]("US"))
// sparkFilter: Criteria[SparkSQL] = (`age` >= 18) AND (`country` = US)

sparkFilter.value
// res7: String = "(`age` >= 18) AND (`country` = US)"

In your Spark 2.13 project, you would use it like this:

// In your Spark 2.13 project:
val filterStr: String = sparkFilter.value
df.where(filterStr)

ETL partition filter

val partitionFilter = F.col[SparkSQL]("partition_date")
.between(F.range[SparkSQL, String]("2025-01-01", "2025-03-31"))
.and(F.col[SparkSQL]("region").in(F.array[SparkSQL, String]("us-east-1", "eu-west-1")))
// partitionFilter: Criteria[SparkSQL] = (`partition_date` BETWEEN 2025-01-01 AND 2025-03-31) AND (`region` IN (us-east-1, eu-west-1))

partitionFilter.value
// res8: String = "(`partition_date` BETWEEN 2025-01-01 AND 2025-03-31) AND (`region` IN (us-east-1, eu-west-1))"

Data quality check

val qualityCheck = F.col[SparkSQL]("email").isNotNull
.and(F.col[SparkSQL]("email").like(F.lit[SparkSQL, String]("%@%")))
.and(F.col[SparkSQL]("age") :> F.lit[SparkSQL, Int](0))
// qualityCheck: Criteria[SparkSQL] = ((`email` IS NOT NULL) AND (`email` LIKE %@%)) AND (`age` > 0)

qualityCheck.value
// res9: String = "((`email` IS NOT NULL) AND (`email` LIKE %@%)) AND (`age` > 0)"