HomeGuidesAPI ReferenceRelease notes
Log In
Guides

Create dataset and connect it to a source

Follow these two primary steps to create a dataset and connect it to a source.

Using the UI:

Step 1: Create a dataset

Start by giving your dataset a unique name. Once you have named it, the platform will assign it a dataset ID. This ID is not fixed and can be edited according to your needs.

Step 2: Connect your dataset to a configured source

Once your dataset has been created, the next step is to connect it to a source that has been previously configured.

If you don't have a source yet

If you have not set up a source yet, don't worry. Follow the instructions provided to configure a new source (link to documentation of connecting to a source).

Select the relevant source

From your list of configured sources, select the one that is relevant to the dataset you just created.

For S3 or GCS

Notification action selection:

Choose which action will trigger a notification to the platform. The options include either the insertion of new data or an update to the existing data.

📘

Pay attention

The initial data insertion into the dataset requires an "Insert" action.

Specify the file or folder path:

Finally, provide the specific path to the file or folder in your source that is relevant to your new dataset. The path will already contain a prefix, so you should only fill in what comes after the prefix. The provided prefix:

  • s3://<bucket name>
  • gs://<bucket name>

🚧

CSV File Format Requirements:

  • Column Name Composition: Each column name can include letters (a-z, A-Z), numbers (0-9), or underscores (_). It must start with a letter or an underscore.
  • Maximum Length: Column names cannot exceed 300 characters.
  • Restricted Prefixes: Column names cannot begin with any of the following prefixes:
    • _TABLE_
    • _FILE_
    • _PARTITION
    • _ROW_TIMESTAMP
    • __ROOT__
    • _COLIDENTIFIER
  • Uniqueness: Duplicate column names are not allowed, regardless of letter casing. For instance, Column1 and column1 are considered identical.

Using the SDK:

Step 1: Create a dataset

The first stage in monitoring your data is to create a dataset. A dataset is where your data will be stored and analyzed.

dataset = sw.dataset.create(name="my dataset")

Step 2: Connect a source to a dataset

Once both the source and the dataset are ready, you'll need to link them to establish data ingress into your dataset.

from superwise_api.models.dataset_source.dataset_source import IngestType

dataset_source = sw.dataset_source.create(dataset_id=dataset.id, source_id=source.id, folder="folder name", ingest_type=IngestType.INSERT)

📘

Folder parameter

provide the specific path to the file or folder in your source that is relevant to your new dataset. The path will already contain a prefix, so you should only fill in what comes after the prefix. The provided prefix:

  • s3://<bucket name>
  • gs://<bucket name>

Additional Information on Datasets

Schema View

After you initially insert data into a dataset, SWE automatically detects the schema of the data. You can then either view or edit the schema fields as needed.

Sample Data View

SWE automatically generates a view of the dataset’s sample data that includes the first 100 records. This provides you with an immediate sense of the data you have just uploaded.