Dataset

Building Machine Learning (ML) or data pipelines is intricately linked to creating and ingesting various datasets. In the following ML scenario, two datasets (at least) are generated during the ML process pipeline: A Training dataset and an Inference dataset. In addition, you may also want to log Test and Validation datasets.

Datasets serve as a core entity in software engineering to describe and organize data with a specific schema for analysis and monitoring. The schema defines the expected structure of a dataset, including fields (columns), types(e.g., Numeric, String, Date), and default values for fields with no provided values.