Customer Overview
DigiSquad is a South African digital payments and value-added services provider founded in 2018. Through its DigiMali platform, DigiSquad aggregates digital financial services into a single interface, enabling users to purchase prepaid electricity, utilities, and other digital services conveniently and reliably.
DigiMali operates in a business-critical fintech environment where transaction integrity, uptime, partner integration reliability, auditability, and secure cloud operations are essential. The platform has established market credibility, including endorsement by Eskom as an official prepaid electricity vending solution.
DigiMali is a production AWS workload supporting live customer transactions.
Challenges
As DigiMali’s user base, transaction volume, and service portfolio expanded, DigiSquad required a stronger AWS operating model. The challenge was not only to modernise the application, but to ensure that the production workload could be operated securely, consistently, and reliably at scale.
The previous operating model created several risks:
- Unpredictable traffic spikes during high-demand periods, especially month-end prepaid electricity purchases.
- Elevated risk of failed or delayed transactions during downstream partner outages or network interruptions.
- Tightly coupled partner integrations that slowed onboarding of new services.
- Limited fault isolation across user, payment, vending, and partner-integration workflows.
- Need for stronger operational visibility across transaction services.
- Need for improved auditability of AWS activity, system events, and operational changes.
- Need for controlled deployment and release management across environments.
- Need for resilience, backup, and recovery planning aligned to financial transaction workloads.
- Need for governance controls that could support production fintech operations.
Without a stronger Cloud Operations model, DigiSquad risked increased downtime, transaction failures, slower release cycles, weaker partner confidence, and difficulty scaling under high-volume prepaid vending demand.
Cloud Operations Capabilities Delivered
Tati Software delivered a customer-deployed AWS Cloud Operations solution to support DigiMali’s production digital payments and VAS workload. The solution was designed to improve centralized operations management, production reliability, deployment control, observability, auditability, transaction resilience, and governance.
The Cloud Operations capabilities delivered included:
- Centralized AWS account governance using AWS Organizations and AWS Control Tower to support controlled multi-environment management.
- Environment separation across production and non-production environments to support safe development, testing, validation, and release preparation.
- Centralized billing and account-level visibility through the AWS Organizations account structure.
- Preventive governance controls using Service Control Policies and AWS Control Tower guardrails where applicable.
- Detective governance controls using AWS Config rules and AWS Control Tower detective controls where applicable.
- Centralized compliance visibility using AWS Config aggregation and compliance dashboards where applicable.
- Centralized audit logging using AWS CloudTrail to capture API activity, role assumptions, user actions, and infrastructure changes.
- Infrastructure as Code deployment using AWS CloudFormation / AWS CDK patterns to support repeatable provisioning and controlled infrastructure changes.
- Containerised workload operations using Amazon ECS and AWS Fargate for independently scalable transaction, payment, user-management, and partner-integration services.
- Event-driven operational workflows using Amazon EventBridge, Amazon SQS, Amazon Kinesis Data Streams, and AWS Lambda to support durable processing, retry handling, and reduced failure impact during partner or downstream service interruptions.
- Managed transactional persistence using Amazon Aurora PostgreSQL to support resilient financial transaction storage.
- Low-latency caching using Amazon ElastiCache for Redis to reduce pressure on the primary database and improve user-facing responsiveness.
- Centralized monitoring and observability using Amazon CloudWatch, Datadog-supported logs, dashboards, metrics, and operational telemetry.
- Incident response and escalation workflows supported by operational logging, technical triage, Jira remediation tracking, and post-incident review.
- Controlled deployment and change management through source-controlled changes, staged releases, validation gates, and rollback-aware operational processes.
- Backup and recovery planning using protected data assets, infrastructure artefacts, deployment definitions, and controlled restoration procedures.
- Secure network isolation using Amazon VPC, segmented network design, and controlled internal service communication.
These capabilities enabled DigiMali to operate as a governed, auditable, and production-grade AWS workload. The implementation improved transaction reliability, supported national-scale prepaid vending demand, reduced transaction-related incidents, and gave DigiSquad a stronger operational foundation for future growth.
Centralized AWS Governance and Account Management
Tati Software implemented a centralized AWS operations governance model to support controlled, secure, and repeatable management of the DigiMali workload.
The environment is managed using AWS Organizations and AWS Control Tower, with workload separation across production and non-production environments. This structure enables centralized governance while maintaining isolation between environments used for live operations, validation, and release preparation.
The centralized governance model supports:
- Account structure and environment separation through AWS Organizations.
- Baseline account governance through AWS Control Tower.
- Preventive controls using Service Control Policies and AWS Control Tower guardrails where applicable.
- Detective controls using AWS Config rules and AWS Control Tower detective controls where applicable.
- Centralized configuration and compliance visibility through AWS Config aggregation.
- Centralized auditability through AWS CloudTrail.
- Centralized security posture review through AWS Security Hub where applicable.
- Centralized billing and account-level cost visibility through AWS Organizations.
- Controlled infrastructure provisioning through Infrastructure as Code.
- Operational monitoring through Amazon CloudWatch and Datadog.
- Structured incident response and remediation workflows.
This ensures that DigiMali is not operated as an isolated AWS deployment, but as a governed production workload with consistent operational controls for security, compliance, auditability, deployment, monitoring, and support.
Proposed Solution and Architecture
Tati Software re-architected DigiMali into a modern AWS-based platform using microservices, event-driven workflows, and managed AWS services.
The platform was separated into independently deployable service domains, including:
- User and account services.
- Payment processing services.
- Transaction and vending workflows.
- Partner-integration services.
- Reporting and audit workflows.
- Operational monitoring and support workflows.
The architecture uses AWS services including:
- Amazon ECS with AWS Fargate for containerised microservices.
- Amazon Aurora PostgreSQL for transactional financial data.
- Amazon ElastiCache for Redis for caching hot data and reducing database pressure.
- Amazon EventBridge for decoupled event routing.
- Amazon SQS for durable asynchronous processing, retries, and dead-letter queue handling.
- Amazon Kinesis Data Streams for high-volume transaction event and telemetry streams.
- AWS Lambda for lightweight event-driven tasks such as notifications, webhooks, and background processing.
- Amazon S3 for reports, archives, operational records, and audit artefacts.
- Amazon VPC for secure network isolation and controlled service communication.
- AWS CloudTrail for AWS API audit logging and operational traceability.
- AWS Config for resource configuration visibility and compliance tracking where applicable.
- Amazon CloudWatch and Datadog for monitoring, dashboards, logs, metrics, and operational investigation.
- AWS CloudFormation / AWS CDK for repeatable Infrastructure as Code deployment patterns.
- IAM and AWS Secrets Manager for access control and secure configuration.
This architecture supports transaction resilience, independent service scaling, operational visibility, controlled deployments, and faster partner onboarding while reducing the operational risk associated with tightly coupled service dependencies.
Event-Driven Operations and Transaction Resilience
DigiMali’s transaction workflows require high reliability because user purchases must be recorded, processed, and fulfilled correctly, even when external partners or downstream services experience interruptions.
Tati Software introduced an event-driven operating model to reduce tight coupling and improve resilience. Services publish and consume events through a central event architecture, while asynchronous workflows use durable queues and retry logic.
This operating model supports:
- Reduced blast radius when external partners experience outages.
- Durable retry handling for asynchronous transaction workflows.
- Dead-letter queue patterns for failed or exceptional processing.
- Improved service isolation across payment, vending, and partner-integration components.
- More controlled onboarding of new partners and services.
- Improved operational visibility into transaction flow and failure points.
This directly improved the platform’s ability to remain reliable under high-volume and failure-prone conditions.
Monitoring, Logging, and Incident Response
Tati Software implemented centralized monitoring, logging, and incident response processes to support the production DigiMali workload.
Monitoring and logging are supported through Amazon CloudWatch, Datadog, AWS CloudTrail, structured service logs, and operational dashboards. These tools provide visibility into system health, application behaviour, transaction workflows, infrastructure events, API activity, and operational issues.
When incidents occur, Tati Software follows a structured operational process:
- Intake and evidence capture.
- Initial triage by operations or support teams.
- Escalation to technical teams where required.
- Log-based investigation using application, transaction, infrastructure, and AWS telemetry.
- Remediation tracking through Jira for confirmed defects or engineering changes.
- Controlled release through development, staging, validation, and production deployment.
- Closure, documentation, and post-incident review.
Where manual investigation does not immediately identify the cause of an issue, Datadog-supported log analysis can be used to identify recurring patterns, unusual behaviour, and related service activity. Corrective actions remain subject to engineering review, operational procedures, and normal change controls.
This operational model supports faster investigation, clearer ownership, and continuous improvement after incidents.
Backup, Recovery, and Resilience
Tati Software implemented resilience and recovery planning aligned to DigiMali’s business and regulatory context.
For DigiMali, recovery is designed around a backup-and-restore strategy within AWS Africa (Cape Town), aligned to South African data residency requirements. The platform does not use cross-region failover for DigiMali because backup storage, restoration, rebuild, and recovery operations must remain within the approved regional boundary.
Recovery planning includes:
- Protected database backups and snapshots.
- S3-stored operational records and artefacts.
- Infrastructure as Code artefacts.
- Deployment definitions.
- Application source code.
- Required configuration and secrets.
- Controlled infrastructure rebuild procedures.
- Service restoration from protected data sources.
For DigiMali, the documented target recovery objectives are:
- RPO: less than one hour for protected data assets.
- RTO: approximately thirty minutes for restoration of critical services, provided AWS Africa (Cape Town) remains operational.
This approach provides a practical balance between resilience, data residency, simplicity, and cost.
Metrics for Success
KPI 1: Operational Support Efficiency
Baseline:
Before modernisation, support teams had limited centralized visibility into transaction flow, partner integrations, infrastructure health, and service failures, which made investigations more manual and time-consuming.
Target:
Reduce investigation effort through centralized monitoring, logs, dashboards, alerts, and transaction traceability.
Measured Result:
Routine investigation effort for standard operational incidents was reduced by approximately 50–60% after implementation, as teams could use centralized logs, dashboards, queue status, retry records, and infrastructure telemetry to isolate issues faster.
Measurement Method:
Comparison of pre-modernisation manual investigation processes against post-modernisation support records, incident notes, monitoring dashboards, and log-based investigation activity.
Business Impact: Support teams could triage issues faster, reduce operational overhead, and respond more effectively to customer-impacting events. overhead and improving the platform’s ability to respond to customer-impacting events.
KPI 2: Transaction Reliability / Incident Reduction
Baseline:
Before the modernisation, the platform had elevated risk of failed, delayed, duplicated, or unresolved transactions during partner outages, API timeouts, interrupted processing flows, and high-load periods. Transaction-related support incidents required manual investigation and created operational overhead for both business and technical teams.
Target:
Reduce transaction-related production incidents by improving event-driven processing, retry handling, queue-based recovery, monitoring, and operational visibility.
Measured Result:
After implementation, transaction-related incident tickets were reduced by an estimated 80%, based on comparison of pre-modernisation and post-modernisation support records. Platform availability was maintained above 99.9% during monitored production periods, including high-load windows.
Measurement Method:
Tati Software compared production support tickets, failed transaction counts, transaction exception records, retry outcomes, and uptime monitoring data before and after implementation. Incident reduction was measured by reviewing transaction-related support cases and operational logs across comparable production periods.
Business Impact:
DigiMali improved transaction integrity, reduced customer-impacting incidents, shortened operational investigation effort, strengthened partner confidence, and improved reliability for financial transaction workflows.
Outcomes
The DigiMali Cloud Operations implementation enabled DigiSquad to operate a more resilient, scalable, and auditable digital payments platform. The solution improved transaction reliability, supported national-scale demand, strengthened partner integration resilience, improved production visibility, and reduced the operational burden of managing a high-volume fintech workload.
Lessons Learned
The engagement reinforced the importance of designing for decoupling, observability, and retry handling from the beginning. Event-driven workflows, durable queues, managed data services, and controlled release pipelines reduced operational risk and improved service continuity. For future transaction-heavy workloads, Tati Software continues to prioritise infrastructure automation, failure-mode validation, centralized telemetry, and operational runbooks before production scale-up.