Scala Snacks: Part 1 - Sets are invariantly interesting

As software engineers we often stumble upon interesting technical curiosities relating to the languages, libraries and infrastructure we employ in our systems. Exploring these novel behaviours can give us a better understanding of the deeper complexities of those technologies. This is the first post of a series which will explore some such informative nuggets we've come across while building our Scala services here at Football Radar. Here's our first Scala snack.

Setting the scene

I was recently attempting to refactor the code around data being consumed and produced by certain APIs between services. We'll use the typical inheritance model of animals and mammals to illustrate the example. Implementation details aren't important.

sealed trait Animal  
sealed trait Mammal extends Animal  
sealed trait Bird extends Animal  

Now let's suppose we're consuming some API which sends us animals in string format, for example some JSON off a Kafka queue. We have a requirement that all animals must be of the same type in the consumed data, and this validation has already been done. The function below deserialises the input.

def deserialiseAnimals(jsonBlob: String, animalType: Int): Set[Animal] = animalType match {  
  ...
  case Mammal.id =>
    val mammals: Set[Mammal] = deserialiseMammals(jsonBlob)
    mammals.map(x => mammalWithName(x))
}

Not too nice. Straight off there should probably a wildcard or bound generic type parameter in the signature, for example return a Set[T] where T <: Animal. This would encode the API's requirement at compile-time and avoid the pattern matching. We'll come back to that at the end. The Object in which this resides is bloated due to many similar deserialisation functions. One thing we could do is move functions to the companion objects of the animals, for example:

case object Mammal {  
  def deserialise(jsonBlob: String): Set[Mammal] = { ... }
}

We would then have something like

def deserialiseAnimals(jsonBlob: String, animalType: Int): Set[Animal] = animalType match {  
  ...
  case Mammal.id => Mammal.deserialise(jsonBlob)
}

You may or may not be surprised that this doesn't compile. Interesting! There are two independent questions here: why doesn't it compile now, and why did it compile before?

Why doesn't it compile?

Sets in Scala are invariant. Specifically, the signature of the Set trait is defined here as trait Set[A]. This means that, using Scala type parameter syntax, if A >: B holds then both Set[A] >: Set[B] and Set[B] >: Set[A] do not hold. Hence we cannot return a Set[Mammal] or Set[Bird] from this function. Why is that? The developers of Scala rarely do anything without good reason. I dug a bit deeper.

The Scala List[+A] class is covariant. This seems intuitive, since we expect to be able to assign a List[Mammal] to List[Animal]. Well, maybe that claim itself is debatable but I'll leave that one for another time. On the REPL:

scala> trait Animal  
defined trait Animal

scala> trait Mammal extends Animal  
defined trait Mammal

scala> val animals: List[Animal] = List[Mammal](new Mammal{})  
animals: List[Animal] = List($anon$1@73b10975)  

Actually this is only safe because Scala lists are immutable. Hence if we try to do the following, we get an error.

scala> trait Bird extends Animal  
defined trait Bird

scala> animals(0) = new Bird {}  
<console>:14: error: value update is not a member of List[Animal]  
       animals(0) = new Bird {}
       ^

Well that's a relief! If that weren't the case we could end up trying to wedge a Bird into a List[Mammal] and it would only blow up at runtime. Scala does provide a MutableList as well, but that's invariant so the compiler will stop us:

scala> import scala.collection.mutable._  
import scala.collection.mutable._

scala> val animals: MutableList[Animal] = new MutableList[Mammal]()  
<console>:22: error: type mismatch;  
 found   : scala.collection.mutable.MutableList[Mammal]
 required: scala.collection.mutable.MutableList[Animal]
Note: Mammal <: Animal, but class MutableList is invariant in type A.  

We're covered either way. This is a pitfall of Java and C# arrays. They are both covariant, for historical reasons. Lists in both languages are invariant, because they are not immutable (by default). Put succinctly by C# guru Jon Skeet:

No, a List<Dog> is not a List<Animal>. Consider what you can do with a List<Animal> - you can add any animal to it... including a cat. Now, can you logically add a cat to a litter of puppies? Absolutely not.

I digress. So why are sets in Scala invariant, despite being immutable? Martin Odersky assigns the invariance of sets to their implementation details:

On the issue of sets, I believe the non-variance stems also from the implementations. Common sets are implemented as hashtables, which are non-variant arrays of the key type. I agree it's a slightly annoying irregularity.

That's sad, but probably true. An abstraction being defined based on common implementations.

I prefer to believe there could be deeper reasoning. I like this justification. We can think of a Set[A] as simply a mapping - for every candidate element A, is the element present in the set or not? Here's an interesting question on that topic. This representation of a set treats it as a characteristic (indicator) function. I'm on board with that notion.

So in explicit terms a Set[A] can be thought of as a Scala function of type Function[A, Boolean]. It turns out scala.collection.Set extends (A => Boolean) so it literally is one of those! Clearly I'm not the only one who likes the idea of sets as functions.

This stackoverflow answer does, however, skirt over something I had difficulty convincing myself of:

due to the contravariance of functions

Specifically, functions are contravariant in their input parameters. This might be obvious to some - it's something we take advantage of every day. Indeed the Scala doc of function types indicate as much. I'll leave that discussion for another time. For now, let's take it as a given. To state what that means explicitly: if A >: B then Function[A, Boolean] <: Function[B, Boolean].

I think we could now construct a pseudo-proof of Scala Set invariance. Forgive my questionable use of symbols in the following:

Let A >: B. Treating sets as functions, we can say Set[A] = Function[A, Boolean] and Set[B] = Function[B, Boolean]1. But since functions are contravariant, it follows that Set[A] <: Set[B]. This means sets are at best contravariant (at best meaning they could also be invariant, since subtyping is an identity relation2). Now we can make an assumption that sets are covariant, that is Set[A] >: Set[B]. This is a contradiction, since covariance and contravariance are mutually exclusive. We've proved by contradiction that sets cannot be covariant. What about contravariance? This also seems counterintuitive in a data structure sense. It means we could make the assignment val bs: Set[B] = Set[A](new C{}) but when we try to read out the element as a B we would get a class cast exception. We can reinforce this intuition more rigorously - a set has the requirement of implementing Iterable which is covariant. This rules out contravariance. The best we can settle for is invariance.

Ok, now I'm convinced that Scala sets (and set implementations in general) should probably be invariant. In terms of the animal example, as mentioned previously a nice solution might be to use an explicit bound type parameter to enforce the requirement of a single type of animal. The signature would be - def deserialiseAnimals[T <: Animal](...): Set[T]. That's assuming we can't restructure the API in a better way, which was impractical in this particular case.

Anyway, onwards and upwards.

Why did it compile before?

Recall that we had:

    val mammals: Set[Mammal] = deserialiseMammals(jsonBlob)
    mammals.map(x => mammalWithName(x))

When I refactored this code, it no longer compiled. What's the difference? The map function seems to be the only difference. To the REPL:

scala> val animal: Set[Animal] = Set[Mammal](new Mammal{})  
<console>:13: error: type mismatch;  
 found   : scala.collection.immutable.Set[Mammal]
 required: Set[Animal]
Note: Mammal <: Animal, but trait Set is invariant in type A.  

Ok fair, we expect that now. We know Scala sets are invariant. What about:

scala> val animal: Set[Animal] = Set[Mammal](new Mammal{}).map(x => x)  
animal: Set[Animal] = Set($anon$1@6793f752)  

Aha! The map function appears to "break" the invariance of sets in Scala. Those better versed in Scala than I will immediately know the reason for this. Let's take a look at the signature the map used by Set:

 def map[B, That](f: A => B)(implicit bf: CanBuildFrom[This, B, That]): That

A cheeky implicit. This blog post by a Scala library developer gives a nice overview of how CanBuildFrom works, and why they're engineering it out of Scala 2.13. In summary, it provides a way for the compiler to "choose" which type to return from a function.

In our case, A is Mammal, B is Mammal, This is Set[Mammal] and That is Set[Animal]. Our function f is of type (Mammal => Mammal). Here's the source code for that map function. The function f is iteratively applied to get us some Bs and the bf creates a builder which takes those elements of type B and builds a That out of them. Let's convince ourselves that this is what's actually happening in the REPL - you never know with implicits.

First we need to get our hands on a CanBuildFrom instance. It's not totally obvious where the compiler is finding the in-scope implicit bf. After a bit of digging it can be traced up through the Set's trait hierarchy. We end up finding an instance exposed through a public function in scala.collection.generic.GenTraversableFactory3:

def ReusableCBF: GenericCanBuildFrom[Nothing] = ...  

This is used in various places to create CanBuildFrom instances, for example on the GenTraversable companion object4:

implicit def canBuildFrom[A] = ReusableCBF.asInstanceOf[GenericCanBuildFrom[A]]  

This looks like what we need in the REPL. Let's try to hack an instance together and explicitly call map with the CanBuildFrom instance:

scala> import scala.collection._  
import scala.collection._

scala> import scala.collection.generic._  
import scala.collection.generic._

scala> val animalCbf = GenIterable.ReusableCBF.asInstanceOf[CanBuildFrom[Set[Mammal], Mammal, Set[Animal]]]  
animalCbf: scala.collection.generic.CanBuildFrom[scala.collection.Set[Mammal],Mammal,scala.collection.Set[Animal]] = scala.collection.generic.GenTraversableFactory$$anon$1@1132baa3

scala> val animals: Set[Animal] = Set[Mammal](new Mammal {}).map(x => x)(animalCbf)  
animals: scala.collection.Set[Animal] = Set($anon$1@7d15c513)  

Nice. Presumably that's similar to what's happening at runtime in the real code, except that the implicit is resolved by the compiler beforehand. Now let's fudge the CanBuildFrom by changing the That to Set[Mammal]:

scala> val mammalCbf = GenIterable.ReusableCBF.asInstanceOf[CanBuildFrom[Set[Mammal], Mammal, Set[Mammal]]]  
mammalCbf: scala.collection.generic.CanBuildFrom[scala.collection.Set[Mammal],Mammal,scala.collection.Set[Mammal]] = scala.collection.generic.GenTraversableFactory$$anon$1@1132baa3

scala> val animals: Set[Animal] = Set[Mammal](new Mammal {}).map(x => x)(mammalCbf)  
<console>:26: error: type mismatch;  
 found   : scala.collection.generic.CanBuildFrom[scala.collection.Set[Mammal],Mammal,scala.collection.Set[Mammal]]
 required: scala.collection.generic.CanBuildFrom[scala.collection.Set[Mammal],Mammal,scala.collection.Set[Animal]]

That proves it! The reason the code compiled before is that the compiler was resolving an implicit CanBuildFrom variable somewhere in scope, which enabled it to choose the parent type Animal in place of its subtype Mammal. If we explicitly call map with a different CanBuildFrom the code doesn't compile.

It's interesting to get a glimpse into what the Scala compiler does behind the scenes. I'd be curious to know whether the author of the code I was refactoring knew that there was an implicit being pulled in, without which their code would not compile. As of Scala 2.13, code like that will continue to compile, but the means by which it's achieved will be completely different. I'll point you again to this blog post for more details on that subject.


Well, that was fun. Forgive me that this particular Scala snack was not so much a snack as a hearty breakfast. At least I learnt some things.

Here are a couple more of our Scala blogs for those interested:

Citations:

  1. Probably what this "=" really means is something like sets and functions are isomorphic, meaning that there's a bijection between the set of all sets and the set of all functions. For each set there exists a function that can entirely describe the contents of the set, and vice versa. Any proper Mathematicians out there can validate whether that statement is remotely correct!

  2. For example see this formal mathematical definition.

  3. Source code: https://github.com/scala/scala/blob/v2.12.1/src/library/scala/collection/generic/GenTraversableFactory.scala#L45

  4. Source code: https://github.com/scala/scala/blob/v2.12.1/src/library/scala/collection/GenTraversable.scala#L31