HomeGuidesAPI ReferenceRelease notes
Log In
Guides

Connect new source

SWE provides a seamless data integration process by allowing users to link their data to our platform. Presently, we support Amazon S3 and GCP buckets as Sources.

📘

Supported file types

SWE supports CSV, Parquet, and JSONL file types. Ensure your files meet the required format for successful processing.

🚧

File name limitations

Please note that the file name should only include letters and numbers [A-Za-z0-9].

❗️

1:1 Relationship between Bucket and Source

Each bucket should have a unique Source configuration. This means you can only designate the same bucket as a Source once. However, you can utilize different folders within the bucket for different datasets while still using the same Source.

To connect your data, please follow the instructions below:

Connect to a S3 source

To connect to a S3 source, you will need the following bucket information:

  • Bucket ARN - The ARN for an S3 bucket includes the service (S3), the region, the AWS account ID, and the bucket name. The structure of an S3 bucket ARN is as follows: arn:aws:s3:::bucket_name
  • SQS ARN - To ensure that the platform remains synchronized with the selected bucket, please set up an SQS queue. Once set up, our system will monitor the queue for any new file uploads. The format for an SQS ARN is as follows: arn:aws:sqs:region:account-id:queue-name
  • Access key ID - This is a unique identifier associated with an AWS IAM user or role.
  • Secret access key - This is a secret part of the credential pair, akin to a password.

To ensure that the platform remains synchronized with the selected bucket, please set up an SQS queue. Once set up, our system will monitor the queue for any new file uploads.

Prerequisites

To create a source in our platform, there are a number of prerequisites that need to be met. These prerequisites are necessary to set up the communication and permission structure within AWS that allows our platform to access and interact with your S3 data.

Setting Variables:

You need to define a series of variables to be used in configuration:

REGION="fill in"
QUEUE_NAME="fill in"
BUCKET_NAME="fill in"
ACCOUNT_ID=$(aws sts get-caller-identity --output text --query Account)
SOURCE_NAME="superwise-s3-source-${BUCKET_NAME}-${QUEUE_NAME}"
USER_NAME=$SOURCE_NAME
export AWS_PAGER=""

Create SQS Queue:

This step involves executing AWS CLI commands to create an SQS queue that will be linked to your S3 bucket to receive notifications.

QUEUE_URL=$(aws sqs create-queue --queue-name $QUEUE_NAME --region $REGION --output text --query 'QueueUrl')
QUEUE_ARN=$(aws sqs get-queue-attributes --queue-url $QUEUE_URL --attribute-names QueueArn --output text --query 'Attributes.QueueArn')

Create SQS Queue access Policy:

This policy allows the S3 bucket to send messages to the SQS queue (triggered by events such as file uploads) and ensures that only your bucket can send these notifications to this particular queue.

  • The policy is configured and applied to the SQS queue, detailing the permissions around how the bucket can interact with the queue.
cat <<EOF > /tmp/${QUEUE_NAME}-policy.json
{
    "Version": "2012-10-17",
    "Id": "${QUEUE_NAME}",
    "Statement": [
        {
            "Sid": "${QUEUE_NAME}",
            "Effect": "Allow",
            "Principal": {
                "Service": "s3.amazonaws.com"
            },
            "Action": [
                "SQS:SendMessage"
            ],
            "Resource": "${QUEUE_ARN}",
            "Condition": {
                "ArnLike": {
                    "aws:SourceArn": "arn:aws:s3:::${BUCKET_NAME}"
                },
                "StringEquals": {
                    "aws:SourceAccount": "${ACCOUNT_ID}"
                }
            }
        }
    ]
}
EOF
QUEUE_POLICY=$(</tmp/${QUEUE_NAME}-policy.json)
aws sqs set-queue-attributes --queue-url $QUEUE_URL --attributes Policy=\"${QUEUE_POLICY//\"/\\\"}\"

Create IAM policy

This IAM policy outlines the permissions the platform will need to effectively interact with the S3 bucket and the SQS queue. It includes permissions to:

  • Retrieve and update bucket notifications.
  • Retrieve messages from the SQS queue and delete them after processing.
  • Access objects in the S3 bucket.

This policy strikes a balance between necessary access and security, providing the least privilege required for operation..

cat <<EOF > /tmp/${SOURCE_NAME}-policy.json
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "S3SourceSetBucketConfig",
            "Effect": "Allow",
            "Action": [
                "s3:GetBucketNotification",
								"s3:PutBucketNotification"
            ],
            "Resource": "arn:aws:s3:::${BUCKET_NAME}"
        },
        {
            "Sid": "S3SourceConsumeMessages",
            "Effect": "Allow",
            "Action": [
                "sqs:GetQueueUrl",
                "sqs:ReceiveMessage",
                "sqs:DeleteMessage"
            ],
            "Resource": "${QUEUE_ARN}"
        },
        {
            "Sid": "S3SourceBucketAccess",
            "Effect": "Allow",
            "Action": [
                "s3:GetObject"
            ],
            "Resource": "arn:aws:s3:::${BUCKET_NAME}/*"
        }
    ]
}
EOF
POLICY_ARN=$(aws iam create-policy --policy-name $SOURCE_NAME --policy-document file:///tmp/${SOURCE_NAME}-policy.json --output text --query 'Policy.Arn')

Create IAM User and Credentials:

Finally, you create an IAM user and generate security credentials (Access Key ID and Secret Access Key) which our platform uses to authenticate and act on behalf of this IAM user when accessing the S3 bucket and SQS queue.

  • The IAM user is tied to the permissions defined in the IAM policy.
  • The access keys are provided to our platform to securely manage and access the source data.
aws iam create-user --user-name $USER_NAME
aws iam create-access-key --user-name $USER_NAME --output table

📘

Pay attention!

Take a note of the secret key

Attach the IAM policy to the user

aws iam attach-user-policy --user-name $USER_NAME --policy-arn $POLICY_ARN

By carefully following these steps, you will establish a secure, private connection between your AWS resources and our platform, enabling a seamless data integration and flow.

For more information on how to set up the SQS queue, please refer to the link provided.

Using the UI

Before you begin, ensure you have completed all the prerequisites.

Here's how to create a new Source:

  1. Open the Source creation form: Click the "S3" button.
  2. Name your Source: Enter a descriptive name for easy identification later.
  3. Connect to Amazon S3: Provide the S3 bucket's ARN (Amazon Resource Name).
  4. Connect to Amazon SQS: Provide the SQS ARN.
  5. Authenticate with credentials:
    • Access Key ID: Enter your access key ID.
    • Secret Access Key: Enter your secret access key.
  6. Save your Source: Click the "Done" button to save and create the Source.

Using the SDK

To create a Source using the SDK, please ensure that you have completed all Prerequisites. Once that's done, proceed with the steps code snippet:

source = sw.source.create_s3_source(
  name="my source", 
  bucket_arn="arn:aws:s3:::my-bucket-arn", 
  queue_arn="arn:aws:sqs:my-queue-arn",
  aws_access_key_id="access_key",
  aws_secret_access_key="secret_key"
)

Connect to a GCP source

When integrating a GCP source, you'll require specific information pertaining to your storage bucket:

  • Bucket name - The unique identifier for your Google Cloud Storage bucket.
  • PubSub topic name - To keep the platform in sync with the designated bucket, configure a Google Cloud Pub/Sub topic. Our system will watch this topic to detect any new files that are uploaded.
  • Service account key - A file in JSON format that contains credentials for your GCP service account, necessary for authentication and authorization.

📘

On-Premises Configuration Flexibility

For those utilizing an on-premises setup, a service account may not be a necessity – this depends on your specific infrastructure and configuration settings.

Prerequisites

To integrate a Google Cloud Storage (GCS) source with our platform, there are several prerequisites that should be addressed. These steps are essential for setting up the correct permissions and ensuring secure communication between our platform and your Google Cloud resources. I'll explain the requirements and their purposes:

Variables Setup:

These are placeholders that you need to replace with actual values for your Google Cloud project and resources.

GCP_PROJECT_ID="fill in"
GCP_SERVICE_ACCOUNT_NAME="fill in"
GCP_TOPIC_NAME="fill in"
GCP_BUCKET_NAME="fill in"
GCP_PROJECT_ROLE="superwise.source.gcs.pubsub.subscriber"
GCP_TOPIC_ROLE="superwise.source.gcs.pubsub.topicSubscriber"
GCP_BUCKET_ROLE="superwise.source.gcs.objectViewer"

IAM Roles Creation:

Custom roles with limited permissions are created to minimize the risk of unauthorized access and ensure that Superwise has only the permissions that are strictly necessary.

  • Project-level role: This role will permit Superwise to subscribe to Pub/Sub topics, allowing for notification when new files are added to the bucket.
  • Topic-level role: Ensures Superwise can attach a subscription to the Pub/Sub topic and receive messages.
  • Bucket-level role: Grants Superwise the ability to view and list objects in the GCS bucket, necessary for reading the data files.
# Project level role
gcloud iam roles create ${GCP_PROJECT_ROLE} \
		--project ${GCP_PROJECT_ID} \
    --title "Superwise GCS Source Pub/Sub Subscriber" \
    --description "Allows Superwise GCS Source to subscribe to Pub/Sub topics (applied on the project level)" \
    --stage BETA \
    --permissions "pubsub.subscriptions.consume,pubsub.subscriptions.get,pubsub.subscriptions.create,pubsub.subscriptions.delete"

# Topic level role
gcloud iam roles create ${GCP_TOPIC_ROLE} \
		--project ${GCP_PROJECT_ID} \
    --title "Superwise GCS Source Pub/Sub Topic Subscriber" \
    --description "Allows Superwise GCS Source to subscribe to Pub/Sub topics (applied on the topic level)" \
    --stage BETA \
    --permissions "pubsub.topics.attachSubscription,pubsub.topics.get"

# Bucket level role
gcloud iam roles create ${GCP_BUCKET_ROLE} \
		--project ${GCP_PROJECT_ID} \
    --title "Superwise GCS Source Storage Object Viewer" \
    --description "Allows Superwise GCS Source to subscribe to Pub/Sub topics (applied on the bucket level)" \
    --stage BETA \
    --permissions "storage.buckets.get,storage.objects.get,storage.objects.list"

Service Account Creation and Key Generation:

A service account is an identity that Superwise will use to interact with your Google Cloud resources securely.

  • The service account should be created specific to the needs of Superwise.
  • A key for this service account is then generated and is used by Superwise to authenticate with Google Cloud.
# Service account
gcloud iam service-accounts create ${GCP_SERVICE_ACCOUNT_NAME} --project ${GCP_PROJECT_ID}

# Service account key
gcloud iam service-accounts keys create \
  --project ${GCP_PROJECT_ID} \
  --iam-account ${GCP_SERVICE_ACCOUNT_NAME}@${GCP_PROJECT_ID}.iam.gserviceaccount.com \
  /tmp/${GCP_SERVICE_ACCOUNT_NAME}.json

Project IAM Bindings

Associating the custom roles with the service account at the project level to enforce proper permissions application.

gcloud projects add-iam-policy-binding ${GCP_PROJECT_ID} \
    --member serviceAccount:${GCP_SERVICE_ACCOUNT_NAME}@${GCP_PROJECT_ID}.iam.gserviceaccount.com \
    --role projects/${GCP_PROJECT_ID}/roles/${GCP_PROJECT_ROLE}

Notification Mechanism and Permission Setting:

  • Topic: A PubSub topic is created as a messaging channel for events such as file additions to the bucket.
  • Topic permissions: Are set to allow the service account to subscribe to this topic.
# Topic
gcloud pubsub topics create ${GCP_TOPIC_NAME} --project ${GCP_PROJECT_ID}

# Topic permissions
gcloud pubsub topics add-iam-policy-binding ${GCP_TOPIC_NAME} \
  --project ${GCP_PROJECT_ID} \
  --member serviceAccount:${GCP_SERVICE_ACCOUNT_NAME}@${GCP_PROJECT_ID}.iam.gserviceaccount.com \
  --role projects/${GCP_PROJECT_ID}/roles/${GCP_TOPIC_ROLE}
  • Bucket: Notifications for the GCS bucket are configured to publish to the PubSub topic for events like object creation (OBJECT_FINALIZE).
  • Bucket permissions: are set to allow the service account to access the data in the bucket under the assigned role.
# Notifications
gcloud storage buckets notifications create gs://${GCP_BUCKET_NAME} \
  --project ${GCP_PROJECT_ID} \
	--topic projects/${GCP_PROJECT_ID}/topics/${GCP_TOPIC_NAME} \
	--event-types OBJECT_FINALIZE

# Bucket permissions
gcloud storage buckets add-iam-policy-binding gs://${GCP_BUCKET_NAME} \
    --project ${GCP_PROJECT_ID} \
    --member serviceAccount:${GCP_SERVICE_ACCOUNT_NAME}@${GCP_PROJECT_ID}.iam.gserviceaccount.com \
    --role projects/${GCP_PROJECT_ID}/roles/${GCP_BUCKET_ROLE}

By following these steps, you provide Superwise with the necessary roles and access to securely hook into GCS and PubSub services for efficient data sourcing. This setup is crucial for maintaining a secure environment that facilitates the specific functionality required without over-privileging, adhering to the best practices of least privilege access

For more information on how to configure a PubSub topic, please refer to the link provided.

Using the UI

Before you begin, ensure you have completed all the prerequisites.

Here's how to create a new Source:

  1. Open the Source creation form: Click the "GCP" button.
  2. Name your Source: Enter a descriptive name for easy identification later.
  3. Provide the bucket information: Enter the bucket name.
  4. Connect a Pub/Sub topic: Add the PubSub topic name.
  5. Authenticate with service account: Upload the service account JSON file.
  6. Save your Source: Click the "Done" button to save and create the Source.

Using the SDK

To create a Source using the SDK, please ensure that you have completed all Prerequisites. Once that's done, proceed with the steps code snippet:

source = sw.source.create_gcs_source(
  name="my source", 
  bucket_name="my_bucket",
  pubsub_topic="projects/my_project/topics/my_topic",
  service_account={}
)