Skip to main content
Version: 2.15.X

Importing Data

Importing data is the first step in the modeling process. Clean and properly formatted data sets the rest of the model up for success.

This section outlines strategies for ensuring data is optimally useful before, during, and after importation into Cogynt.

Transforming Data

The performance of your models will only be as good as the data they are given. Therefore, the quality of your input data matters.

Refer to the Cogynt Authoring User Guide for details on the different datatypes Cogynt supports. Some datatypes have strict formatting requirements. It is important that the data abides by them.

Example

Cogynt can perform datetime operations only on timestamps that are in Zulu format, e.g., 1972-06-22T04:55:30.000000Z.

Ideally, the input data has been cleaned with the necessary transformations before it is published to Kafka.

If data preprocessing is not an option, Cogynt patterns can also be used to perform several basic ETL (extract/transform/load) functions. Refer to the Cogynt Authoring User Guide for details on the computational operations and functions available in the computations toolbox. However, this is only recommended for simple ETL tasks.

Because running ETL in Cogynt means running extra patterns in Flink, it also means sharing computational resources and storage with other core analytic patterns. This may affect performance and create unexpected bottlenecks. Such issues can only be solved by adjusting and optimizing deployment configurations and resource allocation.

Using Schema Discovery

Once input data is stored in Kafka topics, Cogynt can read and extract a data schema from the topics through the Schema Discovery feature in Authoring.

We highly recommended using Schema Discovery to create user data schemas in Cogynt, as opposed to manually creating user data schemas in Authoring. Not only does it save time and reduce the risk of human error, but it can also verify that the data in Kafka is in the correct format.

Example

A timestamp field might be "mistakenly" recognized as a string field because the data sampled by Schema Discovery received datetime values in the incorrect format. (For instance, 1972-06-22T04:55:30 with the missing trailing decimal and Z character.)

Checking your data in Schema Discovery can provide good indicators that the data needs to be revised. Refer to the Cogynt Authoring User Guide for more information on data formats.

Maintaining Topic Naming Conventions

Topic names for input events are typically defined outside of Cogynt. Topic names for all outputs of patterns are defined in Authoring, specifically in the Event Type configuration settings.

Regardless of where the topic name is defined, topics should all have unique, concise, and descriptive names to avoid confusion and reduce the room for human error. Sometimes, it may even be appropriate to name event types and topics after the pattern that it receives output results from.

Example

You have an Event Pattern named Bank Fraud that takes bank account data and bank transaction data as input, and identifies fraud as an output result. Instead of naming your input topics b_acct or b_trans, you should name it bank_account or bank_transaction.

In addition, the output topic may even be named after the pattern (i.e., bank_fraud) so that it is easy to map the output to the pattern while troubleshooting.

Event types that support specific features in Authoring should also be named accordingly. Link analysis, for example, requires creating "Entity" and "Linkage" event types. Modelers are advised to include keywords such as entity or linkage in the topic name to clearly indicate the purpose of that topic.

Include keywords as prefixes in topic names that belong to similar functional groups.

Example

A project consists of groups of patterns that only serve as ETL (extract, transform, and load) patterns, while another group does risk computations, and still another group is responsible for only lexicon filters. You may want to deploy and test them separately.

In this case, name all ETL pattern output topics with an etl_ prefix, all lexicon patterns with a lex_ prefix, and so on. When isolating and testing only ETL or lexicon patterns, you can conveniently reset topics between deployments by filtering your list of topics and only resetting those with an etl_ or lex_ prefix.