SageMaker Processing
Data preprocessing and feature engineering jobs
SageMaker Processing
Mental model
- Managed data processing jobs (preprocess, feature engineering, evaluation, dataset prep) on ephemeral compute.
- Similar to training jobs but meant for general processing containers.
Where it fits
- Data prep for training/evals, batch feature generation, bias checks, model eval pipelines.
Knobs that matter
- Instance type/count, volume size
- Network/VPC: only when needed; watch NAT costs
- Caching: use pipeline caching or reuse outputs in S3
- Container image: keep it small; dependency bloat hurts startup time
Pricing mental model
- Cost ≈ (#instances × hours) (like training), so right-size aggressively.
Terraform template (processing job skeleton)
resource "aws_sagemaker_processing_job" "proc" {
processing_job_name = "${var.name}-proc"
role_arn = var.sm_role_arn
app_specification {
image_uri = var.image_uri
}
processing_resources {
cluster_config {
instance_type = var.instance_type
instance_count = var.instance_count
volume_size_in_gb = 50
}
}
processing_inputs {
input_name = "input"
s3_input {
s3_uri = var.input_s3_uri
local_path = "/opt/ml/processing/input"
s3_data_type = "S3Prefix"
}
}
processing_outputs {
output_name = "output"
s3_output {
s3_uri = var.output_s3_uri
local_path = "/opt/ml/processing/output"
}
}
}
variable "name" { type = string }
variable "sm_role_arn" { type = string }
variable "image_uri" { type = string }
variable "instance_type" { type = string default = "ml.m5.xlarge" }
variable "instance_count" { type = number default = 1 }
variable "input_s3_uri" { type = string }
variable "output_s3_uri" { type = string }