Ease Monitor is designed by the following principles:
Focus on the SLA. Monitoring the API which consumed by end-user.
Metrics Aggregation. Connecting the infrastructure and middleware metrics with application.
Quick Fault Location. The failure always happens, quickly address and recovery the fault is the key.
In other words, the
Ease Monitor is designed for two major scenarios:
Capacity Management. By comparing the data trend, it can help engineering team decides whether add more resource.
Performance Management. Managing the application stack performance, make sure every pieces of stack works fine on production.
Locating the Failure. Once the failure or exception happens, it helps developer find the root location of failure quickly.
Performance Analysis. It helps find the software bottle neck and hot spot which developer can dive into the code.
The following is a case usually happens. A slow SQL or Java Full GC could cause the the whole site running very very slow.
Ease Monitor is a kind of APM - Application Performance Management, but it’s a bit different with the traditional APM software.
There are two aspects impacts the design of the
Different Engineering Angle. We know, there are several engineers role in a company, and they looking at the whole system from different angles. For example,
No reinvent the wheels. Developing a monitor system looks like reinvent another wheel. So, we won’t want to reinvent everything, and need make use the
Ease Monitor is opening and flexible enough to be compatible with the current mainstream monitoring technologies.
Ease Monitor had the following design principles:
Using Mainstream Technology. Most of engineering teams in this world can operate and maintain it.
Every Components can be Replaced or Tailored. People has different requirement and business, so, the design must give the enough flexibility that anyone can modify it easily.
Tracing the Services Requests. The monitor must trace request crossing the distributed system from end to end.
Guiding the Engineering. The monitor must can guide the engineers at least two things, 1) Easily address the issues, 2) Easily make the engineering decision.
Leverage the Automation. The monitor could connect with other control system to do the automated operations, such as: auto-scaling, auto-scheduling, etc.
The Whole Stack Metrics Monitoring. We must monitoring the three layers softwares:
Customized Dashboard. The dashboard can be configured by everyone who have different interests.
The whole architecture based on the open source technology.
The whole architecture not only can monitor big clusters, but also every components can be flexible to replaced or tailored.
Currently, Ease Monitor only supports
The Overview dashboard shows the overall health and capacity.
The following diagram shows the daily SLA report, it could be the whole site or individual service.
The following Service Dashboard put the service traffic, the upstream and downstream services, TOP API，Top 5 slowest tracing request，and the related the resource and the metrics.
The real-time topology could let us understand the architecture of the services.
The Tracing could let us understand the chains of the services call and its performance.
Top N lists show the operations or APIs consumed the time most.
Service Top API List
JDBC Top Operation List
The customized dashboard.
Spring Boot 2.2.x:
RabbitMQ Client 5.x、
Kafka Client 2.4.x
Spring Cloud Gateway
readiness checkendpoint for
StatementMetrics, and related context information (such as, URL, SQL statement, etc.)
Zipkinprotocol to trace the distributed services. which includes:
Downloads easeagent.jar from release , and just simply add the follow arguments for Java application running:
Ease Monitor event handling would deal with the following cases.
Metric - Duration - Threshold. A metric keeps exceeding the threshold in certain duration. (e.g. cpu > 80% lasts 2mins)
Metric - Duration - Percentile - Threshold. A metric’s percentile(e.g. P99) exceed the threshold in certain duration. (e.g. response time P90 > 300ms lasts 2mins)
Metric - Duration - Function - Threshold. Support some simple functions - Sum/Average/Min/Max/Count to trigger the event.
Logs - Duration - Keywords - Times. Monitor a certain keyword(support the regular expression), if the keyword matched the configured times, then report the event.
The following data schema is used for ElasticSearch storing.
|Index mapping template||Index pattern||Description|
|ease-monitor-metrics-*||ease-monitor-metrics-YYYY.MM.DD||Saves time series based metrics of monitored object from different categories. The metrics from different monitored object will be saved into a dedicated document type.|
|ease-monitor-aggregate-metrics-*||ease-monitor-aggregate-metrics-YYYY.MM.DD||Saves calculated performance statistics from different dimensions monitoring requirement needed. The statistics from different dimensions will be saved into a dedicated document type. Due to the statistic calculation are executed on these input metrics directly as streaming and the results will be saved into this index in advance, so the statistics can be loaded and used without any further aggregation（e.g. grouping and computing). This will definitely help the performance of ad-hoc query on the fine-grained metrics ES stored, especially on a large metrics data volume. This index was designed only to save these statistics ones can be calculated by a simple (fast) and fixed (can be implemented on product design stage instead of runtime stage) functions.|
|ease-monitor-logs-*||ease-monitor-logs-YYYY.MM.DD||Saves the logs outputted from OS, middleware and application. The different logs will be saved into a dedicated document type.|
The Document Types Schema include the following things:
Index mapping template
ease-monitor-metrics-*- for metrics data
ease-monitor-aggregate-metrics-*- for java agent metrics data
ease-monitor-logs-*- for logs
application- for Java Agent metrics data.
platform- for a number of middleware metrics - such as:
|Index mapping template||Category||Document type||Description|
|ease-monitor-metrics-*||application||http_request||Saves application HTTP request records, which contains URL address and parameters, execution duration, response code and other useful fields.|
|platform||jvm_memory||Saves JVM performance counters and statistics for heap, non-heap and each spaces.|
|jvm_gc||Saves JVM performance counters and statistics for garbage collector.|
|tomcat_global||Saves the performance counters and statistics of global request processor and thread pool.|
|tomcat_cache||Saves the performance counters and statistics of each context cache.|
|tomcat_servlet||Saves the performance counters and statistics of each servlet.|
|nginx||Saves nginx performance counters and statistics.|
|mysql||Saves mysql performance counters and statistics.|
|redis_server||Saves redis server performance counters and statistics.|
|redis_keyspace||Saves redis key space performance counters and statistics.|
|infrastructure||cpu||Saves the percentage utilization of special logic core.|
|memory||Saves the percentage utilization and capacity in bytes.|
|interface||Saves the performance counters and statistics for each interface separately (without ‘lo’ loop device), e.g. tx and rx bytes.|
|disk||Saves the performance counters and statistics for each block device separately, e.g. iops, mbps. (busy percentage indicator will be added in future).|
|df||Saves the utilization counters for each block device|
|ease-monitor-aggregate-metrics-*||application||http_request||Saves the calculated values of separated and total executions per second in every 1, 5, 15 minutes. The request count will be saved as well.|
|jdbc_statement||Saves the calculated values of separated and total executions per second in every 1, 5, 15 minutes. And also saves minimal, mean, maximal and 25%, 50%, 75%, 95%, 98%, 99%, 99.9% user’s execution duration. The execution count will be saved as well.|
|jdbc_connection||Saves the calculated values of database connection establishment per second in every 1, 5, 15 minutes range. And also saves minimal, mean , maximal and 25%, 50%, 75%, 95%, 98%, 99%, 99.9% user’s connection establishment duration. The establishment count will be saved as well.|
|ease-monitor-logs-*||application||Saves log records collected from application’s component.|
|platform||tomcat_exception||Saves the exception messages of the stack.|
|nginx_access||Saves HTTP access records from nginx access log.|
|nginx_error||Saves error records from nginx error log.|
|mysql_slow_sql||Saves slow SQL records from MySQL log.|
|infrastructure||os_syslog||Saves log records from OS ‘syslog’ file.|
|os_dmesg||Saves log records from OS ‘dmesg’ file.|