Testing and Verification
Cogility recommends three stages of testing when modeling in Cogynt:
Each of these stages is described in detail under the corresponding topic heading.
Functional Testing
A project usually consists of separate functional groups of patterns that interact with each other to create the desired outcome. A functional group may even be further broken down into smaller groups. When modeling, it is advisable to build one small functional unit at a time, and then test it before moving on to the next. A functional unit usually consists of anywhere between 1 to 5 interrelated patterns.
The idea is to isolate and fix any basic functional issues early in the development cycle before the project expands and makes them harder to find. Waiting until the entire project has been built before testing makes it more difficult and time consuming to track down these issues.
Because a functional group can be directly dependent on the output of another group, it is prudent to start building and testing from the lowest level in the hierarchy and build upwards.
Prepare a test data set to test each functional unit. The data should be generated in a way where the models can be tested against a desired outcome. In some cases, if a pattern is particularly complex or plays a more significant role in the group, it may be necessary to test at the pattern level as well.
In most cases, data sets for functional testing can be randomly generated (for more information, see Generating Data). However, special edge cases should be manually included for optimal coverage.
Once the function of the low-level groups has been validated, their output can be used as the test set for the group downstream.
Use Case Testing
Once the entire project has been built, and each sub-component has been tested separately, it is time to test the project to confirm whether each sub-component works together correctly to produce the desired outcome. This includes testing within the data layer, and testing the visualization applications (such as Cogynt Analyst Workstation or any other custom dashboard).
Like in functional testing, a dataset is required to validate the entire model. The test may use randomly generated data and may even reuse some of the test sets already generated for functional testing. However, it is highly recommended that a separate test set is specifically designed to mimic expected use-cases on a small scale.
To test the accuracy of the entire model, simply run each test case through the patterns and verify that the final output is as expected.
Accuracy is not the only goal when testing at this stage. Consider how the results will be consumed, and what it is ultimately used for, and manually test every aspect to make sure the final data output satisfies those needs.
Suppose that you have a project with the following goals:
- The results will be consumed by Cogynt Analyst Workstation.
- The data will be used to create notifications to support a workflow that assists analysts in investigating and creating collections of cases.
In this case, it is essential to test the following:
- Confirm proper data ingestion.
- Verify that all the necessary information is provided in each notification for investigation.
- Confirm that link analysis and drilldown works as intended.
- Test additional functionalities as required.
Similarly, if the data is intended to be shown graphically on a dashboard, make sure the data is structured in a manner that is compatible with the visualization tool. Test to make sure the visualization reflects the expected results for the pre-generated use cases.
Deployment Configuration Testing
After the model has been thoroughly tested for accuracy, and every use case has been validated, it is time to scale.
Resource needs can vary drastically between patterns. The two main factors that affect resource needs are typically:
- Data flow (i.e., volume of data per unit time).
- Complexity (i.e., number of constraints per pattern and computational load).
Optimizing deployments is both a skill and an art. There is no single correct way to set up deployments. However, configuring deployments involves several considerations:
Grouping Patterns
When grouping patterns into deployments, it is best to deploy patterns that serve similar functional roles together.
That said, it is important to limit the number of tasks in each deployment to no more than 150 or so tasks. This number can be computed by taking the number of tasks shown in the Flink dashboard in the Job Overview section divided by the parallelism configured in the deployment.
If there are too many tasks, break the deployed patterns into two or more deployments.
A larger number of tasks requires more memory and CPU assigned to the task manager.
Configuring Resources
When running large patterns or patterns with large amounts of data, it is important to tune the deployment so that the patterns run efficiently.
Adjusting Job and Task Managers
Cogynt uses the Apache Flink streaming engine to run the patterns. All Flink deployments have two components:
- The job manager, which managers the deployment.
- One or more task managers that run the deployed patterns.
The amount of memory and CPU assigned to the job manager (jobManager.memory
and jobManager.cpu
) is proportional to the number of task managers. For small deployments with a few task managers, the default settings should suffice.
The amount of memory and CPU (taskManager.memory
and taskManager.cpu
) assigned to each task manager is proportional to the number of tasks running in each task manager and the number of slots (parallel instances of the job that is run in each task manager). The larger the number of tasks or the larger the number of slots, the higher the resources that need to be assigned to the task manager.
Each task manager is a process that runs on a node in the Kubernetes cluster. Therefore, the maximum amount of CPU and memory that can be assigned to each task manager is limited to the maximum amount CPU and memory available in the node.
If the configured CPU and memory is not available in any of the nodes in the cluster, the job will not deploy.
Typically, it is not recommended to assign more than 32GB of memory and 16 CPUs per task manager.
Adjusting Parallelism
Another way to increase the performance of patterns is to increase the parallelism. The maximum parallelism is determined by the number of task managers multiplied by number of slots in a task manager.
In general, it is recommended that the number of slots be limited to no more than four or five for small patterns, and no more than two for larger deployments.
The suggested way to increase parallelism is to increase the number of task managers. Increasing the number of task managers uses resources more efficiently by spreading the load across all the nodes in the cluster.
Adjusting Flink Performance
Flink has many additional knobs to tune performance. (For more information, refer to the official Apache Flink documentation.) The deployment UI allows fine-tuning your deployment using these various knobs. The most relevant knobs are available in the deployment configuration dropdown list for your convenience, along with additional information on how to configure them.
Monitoring Bottlenecks
Any complex pattern is bound to have some bottlenecks, especially when running a lot of data through it.
Identifying Bottlenecks
It is important to monitor the running patterns to determine where the bottlenecks exist and to optimize the pattern implementation to reduce the bottleneck and improve pattern performance. (In other words, increase throughput while trying to reduce resource utilization.)
To monitor bottlenecks, the Flink dashboard provides a task-level view of the running pattern that shows which tasks in the pattern are consuming high CPU, potentially resulting in back pressure that slows or blocks upstream tasks. A detailed explanation on how to monitor Flink job performance is provided in Monitoring Back Pressure in the official Flink documentation.
Slow tasks in the graph can cause the entire deployment to run more slowly. They can also cause checkpointing (saving the state so that the deployment can resume from where it left off on a restart) to run slowly, which may result in the deployment failing and repeatedly restarting.
Addressing Bottlenecks
Slow deployment performance can be addressed in a few ways, depending on the cause of the slowdown:
- Increasing CPU and memory resources assigned to the task manager.
- Increasing parallelism by increasing the number of task managers. (This is the preferred method of increasing parallelism.)
- Increasing parallelism by increasing the number of slots run in each task manager.
If the Flink dashboard shows that the bottleneck (red box) is in a Create event ...
task, then the computation may be the cause of the bottleneck. If the bottleneck is seen in a join
step, then there may be too much data on one side of the join, resulting in several updates being produced. In either of these cases, it is worth analyzing the patterns to see whether improvements can be made. If not, you can try assigning more resources to the task manager or increasing parallelism to alleviate the situation.