SageMaker Pipelines
ML-specific pipeline orchestration within SageMaker
SageMaker Pipelines
ML-native CI/CD
Mental model
- Orchestrator specialized for ML steps (processing, training, evaluation, model registration).
- Best when you want tight integration with: Model Registry, Experiments, ML lineage, approvals.
Where it shows up in ML/GenAI
- Training pipeline: preprocess → train → evaluate → register → (manual approve) → deploy
- Offline eval pipelines (regression suites, dataset versioning gates)
- Feature processing + validation pipelines
Key knobs (senior knobs)
- Pipeline definition: keep steps small and cacheable
- Caching: enable step caching to avoid recompute on unchanged inputs
- Network isolation / VPC: for regulated workloads
- Model registry + approvals: enforce governance gates
- Artifact locations: S3 prefixes per env; enforce lifecycle policies
Pricing mental model
- Pipelines itself is “thin”: you mainly pay for the underlying jobs you run (training/processing/transform, etc.).
- Senior heuristic: biggest savings come from caching + shutting down idle endpoints and using Spot for training.
Terraform template (SageMaker Pipeline resource)
data "aws_iam_policy_document" "sm_assume" {
statement {
effect = "Allow"
principals { type="Service", identifiers=["sagemaker.amazonaws.com"] }
actions = ["sts:AssumeRole"]
}
}
resource "aws_iam_role" "sm_role" {
name = "${var.name}-sm-pipeline-role"
assume_role_policy = data.aws_iam_policy_document.sm_assume.json
}
resource "aws_iam_role_policy" "sm_policy" {
role = aws_iam_role.sm_role.id
policy = jsonencode({
Version="2012-10-17",
Statement=[
{ Effect="Allow", Action=["s3:*"], Resource=[var.artifacts_bucket_arn, "${var.artifacts_bucket_arn}/*"] },
{ Effect="Allow", Action=["sagemaker:*"], Resource="*" }
]
})
}
resource "aws_sagemaker_pipeline" "pipeline" {
pipeline_name = var.name
role_arn = aws_iam_role.sm_role.arn
pipeline_definition = jsonencode({
Version = "2020-12-01",
Parameters = [],
Steps = [
{
Name = "Preprocess",
Type = "Processing",
Arguments = {
ProcessingResources = {
ClusterConfig = { InstanceType = "ml.m5.xlarge", InstanceCount = 1, VolumeSizeInGB = 30 }
},
AppSpecification = { ImageUri = var.processing_image },
RoleArn = aws_iam_role.sm_role.arn
}
}
]
})
}
variable "name" { type = string }
variable "artifacts_bucket_arn" { type = string }
variable "processing_image" { type = string }