Prometheus at Football Radar

The success of Football Radar is heavily dependent on the availability and high performance of the systems that are built and maintained by the Engineering team.

In recent years we feel we’ve been pretty successful building out our microservices architecture; using Scala, Docker and the awesome tag team duo that is Marathon and Mesos. However, until recently we haven’t been fully satisfied with our monitoring and alerting solutions.

We became aware of Prometheus when researching how we could augment or replace our collection of existing tools. Prometheus has been the subject of considerable praise both in print and online and seemed like a good fit for our environment.

It is worth noting that Prometheus uses a pull based approach of which the pros and cons are well explained here. However, for infrequent or short running services the Prometheus Pushgateway ensures that the platform can still support these kinds of tasks.

Architecture

High Level Monitoring Architecture

At Football Radar we run one dockerised Prometheus instance per monitored service. As we use Marathon and Mesos there is minimal overhead in spinning up and managing the extra containers that come with this approach.

We have found that having service specific instances yields a number of positives. It allows us to store and maintain our alerting rules alongside our code in our repository. Restricting the work expected of each Prometheus instance eliminates any performance concerns. Finally, single instances can be worked on and maintained without impacting the entire monitoring system.

We use the Prometheus Alert Manager to send both email and slack messages whenever problems are detected. Grafana is used for charts and visualisation, which alongside support for DNS Service Discovery (using SRV records published from mesos-dns) allows metrics to be automatically split by service instance.

Grafana Example

Implementation

Almost all of our Scala services use TwitterServer that provides all sorts of lovely functionality out of the box.

Unfortunately TwitterServer doesn't expose its metrics in a format that Prometheus supports. However, as you can see in our code sample, this problem can be easily resolved by extending our simple service with a custom metrics exporter.

object Daemon extends TwitterServer with PrometheusMetricsExporter {

  val receiver = LoadedStatsReceiver.scope("prometheus_demo")
  val requests = receiver.counter("http_requests")

  val httpService = new Service[Request, Response] {
    def apply(request: Request): Future[Response] = {
      requests.incr()
      val response = Response(request.version, Status.Ok)
      response.contentString = "Football Radar!"
      Future.value(response)
    }
  }

...

  def main(): Unit = {
    val server = Http.serve(":8888", httpService)
    Await.ready(adminHttpServer)
  }
}

For each service we have .rules file that controls what our Prometheus instance should be on the lookout for and how it should respond. In addition to creating simple rules around availability and memory usage, it is also surprisingly easy to define complex alerting criteria around custom statistics.

ALERT InstanceDown  
    IF up == 0
    FOR 5m
    LABELS { severity = 'critical', team = 'football_radar' }
    ANNOTATIONS {
        summary = "Instance {{ $labels.instance }} Down",
        description = "{{ $labels.instance }}-{{ $labels.job }} down for 5m"
    }

Conclusions

Prometheus has become a well established part of our tech stack and looks set to stay that way for a long time to come. We'd certainly recommend it to other engineering teams looking to improve their monitoring, especially if they love any of the following:

  • Simple and precise alerting to enable quick resolution to signs of trouble
  • Detailed metrics for both firefighting and postmortems
  • Great integrations with other tools
  • All alerting config in one place
  • Simple to test, deploy and maintain

A big thanks to Mike Cripps for driving the adoption of Prometheus at Football Radar.