More on Microservices

admin   April 16, 2016   Comments Off on More on Microservices

Key Characteristics of a Microservice

In the first blog in this series, I listed several important requirements of a microservice. Let’s take a deep dive into each of them here.

A microservice is an independently deployable and runnable unit.

One of the biggest advantages of a microservice architecture is that you can build, deploy and run each service independently from other services. Each should have its own CI (continuous integration) and deployment pipeline. For most of our microservices, the code for each service is in its own GitHub repository and there is a dedicated CI and deployment pipeline for the code in that repo. That pipeline results in a single artifact that is deployed into our production environment. You could have a single code repo (we could have endless debate over many vs. a single code repo) for all your services, but if you do, each service should have its own build, its own deployment pipeline and its own deployable artifact. We prefer the one service to one repo model, which allows for the internal open source model of submitters and committers.

It’s scoped to a functional area or business capability.

One of the most difficult parts of a microservice architecture is defining the “bounded context” for each of your services. Defining the bounded context needs to take in both business and technical capabilities.

Any large software system has multiple models. When code from multiple models becomes combined and intertwined, things can get messy—resulting in more defects and slower delivery. Defining explicit boundaries between your models helps keep them cleaner,, simpler and easier to understand.

The above diagram shows two bounded contexts for features within the CA Agile Central product. When we were designing the new Capacity Planning feature, we decided that its model was best keep separate from the existing ALM model since it had a well-defined bounded context. So we created a new microservice for the Capacity Planning feature. This allowed us to have a separate team dedicated to building the Capacity Planning service with little or no interruptions to the ALM service team, and it allowed the PCP team to go faster. Also, the Capacity Planning service has very different scalability and capacity requirements from the ALM service. This was another reason to create it as a separate microservice.

It provides the exclusive path to its stored data via published APIs, either internal or external.

A microservice should be the exclusive reader and writer of its data source. Having more than one service manipulate the same data is a bad pattern. Why? Your stored data is no longer encapsulated; you’ve exposed it to everyone. If you change your data model, then it will break other services that directly access that data. If you have too many other services directly accessing your service’s data, then you will not be able to easily change your data model without associating the effort with a large cost.

Since a microservice is the exclusive path to its stored data, your microservice needs to supply a well-defined API to it. Other microservices should use this API to get access to your service’s data. You should design these APIs from the “outside-in” and not from the “inside-out.” This means you should design your API for other services or customers that consume your API, and not just expose the insides of your service’s data model. Design and document your API as if you were releasing it to paying customers, even though your API may be restricted as an internal use only.

As far as storing data for a service, you’ve got a couple of options. One is to store your data in a separate, persistent store instance, such as a database, from other services. Another option is to store your service’s data in the same instance of a persistent store that other services use to store their data, but steps should be taken to ensure that only your service can access that data. This option screams for having a database service that provides the appropriate security, tenancy and permissions. Choosing which one is best involves many factors, such as SLA requirements, dataset size, storage capacity and scalability needs. For example, services with different scalability needs should not be in the same database instance, so you can scale each of those differently.

When defining your microservice, determine which data storage option is best.

Shared code among microservices should be put in a shared library.

Don’t have multiple services sharing the same code. Put shared code in a library that can be shared among services. When your service has library dependencies, use explicit library versions like 1.2.1 and not “LATEST” or “SNAPSHOT.” That allows you to control when you can accept changes to the shared code that will result in having more reproducible and reliable builds.

Availability and location should be discoverable by other microservices.

A microservice architecture is not a static environment. Instances of a running service come and go. As a result, in order for your microservice’s API to be consumed, it needs to be available and discoverable. Apache ZooKeeper is a good technology that can be used to provide service discovery. It keeps track of service nodes as they come and go and removes that burden from the service consumers. Another option is to use a PaaS product, like RedHat’s OpenShift, that has service discovery built-in.

Microservice data is portable.

Ensure that your microservice data can be easily moved from one place to another. A “place” could mean:

  • From one database instance to another in the same datacenter
  • From one database shard to another
  • From one datacenter to another
  • Merge data from two different database instances into one

This requires that all your data have a unique identifier—a universal (UUID)—and not one that’s just local to a specific database instance. For example, when we started out in 2003 we chose to use Oracle sequences for all of our object IDs (OIDs.) That worked fine when we had just a single SaaS product, but as time went on we added an on-premises product and another datacenter. Our OID solution wasn’t portable. We couldn’t easily move customer data from one Oracle instance to another. We had to create re-OIDer tools to do so and for large customers the execution time is painfully long.

So we started using universal identifiers (UUIDs – 128 bit) for our object identity that are portable, and allow data to be moved around from place to place with little or no chance of having object identifiers collide.

It must have data recovery strategy with minimal or no data loss.

If your microservice contains customer data, then not losing that data is critical to your business. There are three data loss cases for which you need to have a data recovery strategy:

  1. Data loss due to a database crash
  2. Database corruption
  3. A malicious event or a non-malicious user error that causes data loss

CA Agile Central’s main service, the ALM WSAPI service, uses an Oracle database. We have two production datacenters, with an Oracle cluster in each datacenter, where one datacenter is “hot” and the other is “warm.” The “warm” datacenter is part of our disaster recovery strategy. Within each datacenter we have a primary, a warm standby and three read-only Oracle instances. Across datacenters we Use Oracle’s Active Data Guard, where all production data in the “hot” datacenter is synchronized to the “warm” datacenter within seconds. We use a flashback Oracle database that provides us a fast way to rewind data back in time to correct any problems caused by logical data corruption or user errors within a designated time window. For certain data losses outside the flashback window, we have a data backup and restore strategy to handle those cases. All of these mechanism allows us to have a recover point objective* that is under two minutes within a datacenter and five minutes across datacenters.

It’s horizontally scalable.

Monolithic applications tend to scale vertically by adding more memory, more CPU and more disk space to a single instance of the application. In contrast, a microservice should handle growth in demand by adding more running instances of your service and all running with the same memory and CPU profiles. This is referred to as horizontal scaling. This model allows for microservices to be run using public cloud infrastructure (cheap commodity hardware.) Horizontal scaling can be more easily achieved by stateless services, so contain state to ease scaling. For our large ALM WSAPI service we now scale up to 16 production service instances to handle the traffic. If demand grows, we just add more instances of the service to the production environment.

Have published SLAs for uptime and performance.

When defining your microservice, you should determine upfront what the Service Level Agreement (SLA) is going to be for your production service. This will affect the design and implementation of your microservice, so you should do it before starting to design and implement it.

SLAs usually include uptime and response time (if possible specify P50, P75, P90, P99) levels. Not all services will have the same SLA levels. More critical services will have higher SLA levels than non-critical ones. For example, our critical authentication service (Zuul) has one of the highest SLA requirement levels, since almost every other service depends on it. It specifies 20ms response time for internal network requests and a 99.9% uptime; whereas some of our other services SLA have a 99.5% uptime goal.

Microservices by their nature are part of a larger product—so you may also want to make their SLAs internal-only, since the product itself most likely has an external SLA with its customers.

It has a defined failover strategy so it’s highly available.

In order to meet your uptime SLA you should have more than one instance of your service running in production. When one fails, you should have a failover mechanism that fails over to another instance of your service. For each service, we have a minimum of three instances running in production. Use a PaaS and/or load balancers (like HAProxy) to handle both load distribution and failover.

It provides an administrative API.

In addition to a consumer API, your microservice architecture should define a standard set of administration API’s—such as start, stop and health check. Standardizing on these will make automation and tool-creation easier. The health check endpoint is a very important one: it’s used for alerting systems like Nagios to alert our Operations team when a service is unhealthy, and used by HAProxy to determine if the node is one it can use in its load balancing.

It has distributed traceability.

Any SaaS product needs excellent production metrics. We’ve found that distributed tracing is key to understanding what’s going on in our complex product in production. A microservice architecture is one where it takes many services to handle a single user request, and those services may be implemented using different programming languages and frameworks. As a result, you need to be able to trace, log and visualize metrics for a user’s request for all the microservices in that request’s call path.

We built our own distributed tracing framework, but there are several out there like Twitter’s ZipKin and Google’s Dapper. Our implementation is based on Google’s Dapper framework and includes traceIDs and spanIDs. Each user request is given a traceID. Each microservice in the call path is considered a “span.” Every span is invoked with the traceID and its parent’s spanID. When logging metrics, a microservice writes out the traceID, parent spanID and its own spanID. Each microservice sends to the next service in the call path the traceID and its spanID. With this we can track the entire user request throughout the entire distributed system.

It can handle a network partition (has partition tolerance.)

A network partition is a failure where one part of a distributed system cannot reach another part of the system due to a failure in network device, which causes a network to be split. A system with partition tolerance is one that can handle a network partition that doesn’t result in a failure of the entire system.

In a distributed microservice architecture, a network partition results in some service nodes not being able to communicate with other service nodes in the system. You should design your service with partition tolerance in mind. This means that the system should continue to work as best as possible if some part of the system has a network partition.

If you’re a believer in the CAP theorem, you know that your distributed system can only have, at most, two out of the three following guarantees:

  • Consistency (all data in the system is consistent at the same time)
  • Availability (a guarantee that every request receives a response)
  • Partition Tolerance (the system continues to operate despite a network partition)

Partition tolerance in a microservice architecture includes not having any single points of failure (SPoF.) Build redundancy into your microservice system! It also means designing your services to not block indefinitely when making synchronous HTTP calls to dependent services (efferent coupling; outgoing dependencies.) Better yet, eliminate or have as little synchronous efferent coupling as possible.

The End?

Well, not really. Microservices are a fairly new concept. Their definition is evolving. New discoveries are made every day through experimentation. So this isn’t the end but the beginning of a new era of microservices.