Ingesting data

Integrating your model data with SUPERWISE® is a simple process meant to improve your ML operations. Once connected, the data is systematically arranged into datasets for easy querying, allowing you to derive insights from all aspects of your operation.

How to connect Cloud Storage to SUPERWISE®

SUPERWISE® currently supports data integration through cloud file storage services like Google Cloud Storage (GCS) and Amazon Web Services (AWS) S3. When a new file is added to your connected file storage, our platform automatically fetches the data.

Here's how to set up this connection:

Step 1: Create a Source

A Source in SUPERWISE® is a secure connection to your cloud storage bucket (e.g., GCS or AWS S3). This step is a one-time configuration that grants SUPERWISE® permission to access your designated bucket.

Step 2: Create a Dataset

After setting up a Source, you'll create a Dataset within the SUPERWISE® platform. This Dataset will store your ingested data, allowing you to:

Query the data.
Build custom Dashboards.
Set up monitoring policies.

Step 3: Connect the Dataset to the Source

Finally, you'll establish a link between your Dataset and its Source. This involves defining a specific path within your cloud storage bucket for each Dataset. This unique path ensures that any new files added to that location in your storage are automatically fetched and integrated into the correct Dataset, keeping your data up-to-date.

Important Considerations

Pre-existing Files: Files already present in your connected cloud storage bucket before you complete the integration steps (1-3) won't automatically be pulled into your dataset. Only new files, or existing files that are re-uploaded or modified after integration, will trigger our platform to fetch the data. For example, files already in the bucket won’t be pulled into the dataset unless they are re-uploaded or modified
Schema and Sample Data: Your dataset's schema defines the names, data types, and structure of your fields, serving as a blueprint for your data. This structure is crucial for machine learning observability as it determines how data is organized, interpreted, and monitored for changes. A well-defined schema ensures that the data aligns with the expected formats and types, facilitating accurate monitoring and troubleshooting. Both the schema and sample data will be created only after the first data insertion into your dataset. Before this, the Dataset object within SUPERWISE® will appear empty
Loading Time: After uploading a file to your cloud storage, it immediately triggers the SUPERWISE® platform to fetch the data. However, after upload, you may experience a brief delay (several seconds to minutes) before the dataset updates and you see the data reflected in SUPERWISE®

🚧
CSV File Format Requirements:
For successful data ingestion from CSV files, please adhere to the following requirements:

Accepted column name characters: Each column name must:

Include only letters (a-z, A-Z), numbers (0-9), or underscores (_).

Start with a letter or an underscore.

Maximum length: Column names cannot exceed 300 characters.

Restricted prefixes:Column names cannot begin with any of the following prefixes.:

_TABLE_

_FILE_

_PARTITION

_ROW_TIMESTAMP

__ROOT__

_COLIDENTIFIER

Uniqueness: Duplicate column names are not allowed, even if their letter casing differs. For example, Column1 and column1 are considered identical and will result in an error.

Steps to Follow

Connect New Source - Understand and establish your Source in SUPERWISE®.
Create Dataset and Connect to a Source - Create your Dataset and ensure it's linked to your Source to enable seamless data updates.
Automated Alerts for Ingestion Failures (Optional) - Optionally, set up policies on "failed data ingestion" events to receive notifications when files fail to upload.

With these steps, integrating your data with SUPERWISE® becomes an efficient process, ensuring continuous data flow and up-to-date insights for optimal ML operations.