In distributed computing systems, a service subscriber comprises a service/network application that wants to use another service, referred to as a service publisher, that exports some functionality desired by the service subscriber. Service discovery in large-scale distributed systems is generally directed towards matching a subscribing service to the correct version of a published service.
To facilitate discovery, the service publisher may publish relevant information including its functionality and version in a service repository or (registry). Alternatively, a subscriber may poll existing services or some well-known service for the targeted endpoint. The service repository mainly comprises a dynamic, highly-available, distributed associative database that stores metadata about healthy, available published services. Once a publisher is discovered by a subscriber, the services interact with each other's APIs via endpoints. Endpoint resolution refers to the process of mapping a service name to a compatible endpoint using the associative database in the repository, which is followed by binding to the endpoint and endpoint activation.
In most modern scenarios, the repository database is relatively static, comprising a write-once, read-many database that is batch loaded infrequently but read from heavily. In a large distributed system on the order of hundreds of thousands of service instances, services tend to fail relatively often. Moreover, the set of available publisher server instances may be intentionally changed, such as due to auto-scaling policies by the distributed infrastructure. The infrastructure may scale instances up or down due to proportionate workload increases and decreases, respectively. As a result, services tend to be constantly deployed, redeployed or rolled back, and both ephemeral and persistent network partitions are common.
In such an environment, static or near-static repositories tend to represent the state of the services in the distributed system in the past. Endpoint mappings in the repository may be stale, leading to false positives during endpoint resolution, (where endpoint resolution refers to the process of mapping a service name to a compatible endpoint using the associative database in the repository). Such false positives lead to connection failure, retries, and more cascading failures, further increasing the entropy and network load in the distributed system.