Machine learning Well-Architected Framework pillars & best practices

Thangarajan Nagarethinam
4 min readOct 14, 2022

--

Given below are well-Architected Framework provides architectural best practices for designing and operating machine learning workloads in the cloud. The Framework consists of five pillars.

1️⃣ Operational Excellence
2️⃣ Security
3️⃣ Reliability
4️⃣ Performance Efficiency
5️⃣ Cost Optimization

The six phases for the ML lifecycle referenced in this paper are illustrated in below diagram in a sequence.

https://docs.aws.amazon.com/wellarchitected/latest/machine-learning-lens/images/ml-lifecycle-phases.png

List out best practices each of framework pillars in the machine learning life cycle.

📚 ️Operational excellence pillar
► MLOE-01: Develop right skills with accountability and empowerment
► MLOE-02: Establish ML roles and responsibilities
► MLOE-03: Prepare an ML profile template
► MLOE-04: Establish model improvement strategies
► MLOE-05: Establish a lineage tracker system
► MLOE-06: Establish feedback loops across ML lifecycle phases
► MLOE-07: Profile data to improve quality
►MLOE-08: Document data processing
►MLOE-09: Automate operations through IaC and CaC
►MLOE-10: Establish scalable patterns to access approved public libraries
►MLOE-11: Establish deployment environment metrics
►MLOE-12: Enable monitoring health of model endpoint
►MLOE-13: Synchronize architecture and configuration, and check for skew across environments

📚Security pillar
►MLSEC-01: Validate ML software privacy and license terms
►MLSEC-02: Ensure least privilege access
►MLSEC-03: Secure data and modeling environment
►MLSEC-04: Protect sensitive data privacy
►MLSEC-05: Enforce data lineage
►MLSEC-06: Keep only relevant data
►MLSEC-07: Detect transfer learning risk
►MLSEC-08: Secure governed ML environment
►MLSEC-09: Secure inter-node cluster communications
►MLSEC-10: Protect against data poisoning threats
►MLSEC-11: Protect against adversarial and malicious activities
►MLSEC-12: Restrict access to intended legitimate consumers
►MLSEC-13: Monitor human interactions with data for anomalous activity

📚Reliability pillar
►MLREL-01: Discuss and agree on the level of model explainability
► MLREL-02: Use APIs to abstract change from model consuming applications
► MLREL-03: Adopt a machine learning microservice strategy
► MLREL-04: Use a data catalog
►MLREL-05: Use a data pipeline
►MLREL-06: Automate managing data changes
►MLREL-07: Enable CI/CD/CT automation with traceability
►MLREL-08: Ensure feature consistency across training and inference
►MLREL-09: Ensure model validation with relevant data
►MLREL-10: Automate endpoint changes through a pipeline
►MLREL-11: Use an appropriate deployment and testing strategy
►MLREL-12: Allow automatic scaling of the model endpoint
►MLREL-13: Ensure a recoverable endpoint with a managed version control strategy

📚Performance efficiency pillar
►MLPER-01: Determine key performance indicators, including acceptable errors
►MLPER-02: Understand and manage the available services and resources
►MLPER-03: Review fairness and explainability
►MLPER-04: Define relevant evaluation metrics
►MLPER-05: Use a data lake house architecture
►MLPER-06: Optimize training and inference instance types
►MLPER-07: Explore alternatives for performance improvement
►MLPER-08: Establish a model performance evaluation pipeline
►MLPER-09: Establish feature statistics
►MLPER-10: Perform a performance trade-off analysis
►MLPER-11: Evaluate machine learning deployment option (cloud versus edge)
►MLPER-12: Evaluate model explainability
►MLPER-13: Evaluate data drift
►MLPER-14: Monitor, detect, and handle model performance degradation
►MLPER-15: Establish an automated re-training framework
►MLPER-16: Review for updated features
►MLPER-17: Include human-in-the-loop monitoring

📚Cost optimization pillar
►MLCOST-01: Define overall return on investment (ROI) and opportunity cost
►MLCOST-02: Use managed services to reduce total cost of ownership (TCO)
►MLCOST-03: Identify if machine learning is the right solution
►MLCOST-04: Enable data and compute proximity
►MLCOST-05: Select optimal algorithms
►MLCOST-06: Tradeoff analysis on custom versus pre-trained models
►MLCOST-07: Enable debugging and logging
►MLCOST-08: Use managed data labeling
►MLCOST-09: Use data wrangler tools for interactive analysis
►MLCOST-10: Enable feature reusability
►MLCOST-11: Establish data bias detection and mitigation
►MLCOST-12: Select optimal computing instance size
►MLCOST-13: Select local training for small scale experiments
►MLCOST-14: Select an optimal ML framework
►MLCOST-15: Use automated machine learning
►MLCOST-16: Use distributed training
►MLCOST-17: Stop resources when not in use
►MLCOST-18: Start training with small datasets
►MLCOST-19: Use warm-start and checkpointing hyperparameter tuning
►MLCOST-20: Use hyperparameter optimization technologies
►MLCOST-21: Use an inference pipeline
►MLCOST-22: Monitor usage and cost by ML activity

Conclusion

Well-Architected ML design principles provide the guidance for the best practices collection. The technology and cloud agnostic best practices across the Well-Architected pillars provide architectural guidance for each phase of the ML lifecycle.

Use this guidance for the best practices collection ensure that your ML workloads are architected with operational excellence, security, reliability, performance efficiency, and cost optimization in mind. Plan early and make informed decisions when designing new workloads.

Visit me on my Social Media to have a more in-depth conversation or any questions.

--

--