Available for opportunities

Aman Kumar
Sahu

|

Building intelligent data platforms that power next-gen AI applications. MS Data Science @ UMD (3.73 GPA) · 2+ years experience · $1.2M+ in cost savings · 3× Hackathon Winner

2+
Years Experience
3.73
MS GPA
3
Hackathons Won
$1.2M+
Cost Savings
50+
Pipelines Built
1B+
Records Processed

About Me

Passionate Data Engineer building scalable data pipelines and intelligent AI systems

Who I Am

I'm a passionate Data Engineer and AI/ML Specialist with a Master's degree in Data Science from the University of Maryland (3.73 GPA) and over 2 years of hands-on experience building scalable data pipelines and intelligent systems.

My expertise spans the entire data engineering lifecycle—from architecting robust ETL pipelines processing billions of records to implementing cutting-edge machine learning solutions. I've delivered measurable impact including $1.2M+ in cost savings and maintained 99.9% system uptime.

I'm driven by solving complex technical challenges and leveraging AI to create innovative solutions. Whether it's optimizing data infrastructure at scale or building next-generation ML applications, I thrive on pushing the boundaries of what's possible with data and artificial intelligence.

Professional Impact

Built 50+ production Hadoop data pipelines processing 10TB+ daily data with 99.9% uptime at Tata Consultancy Services (PNC Bank)

Education

MS in Data Science from University of Maryland (3.73 GPA). BTech in Mechatronics Engineering with Robotics specialization from SRM Institute (8.91 CGPA)

Achievements

3× Hackathon Winner ($2,500 First Place at AI & Food Insecurity Competition, featured on CBS News)

Experience & Education

My professional journey building enterprise-scale AI/ML systems

LLM/AI Engineer Intern

Connyct

May 2025 – Aug 2025 (Current)
College Park, MD

Architecting AWS-native event-driven pipelines and multi-agent RAG systems

  • Architected AWS-native event-driven pipeline (EventBridge, SQS, Step Functions) with Lambda processing achieving 38% P95 latency reduction and 99.5% uptime SLA
  • Designed DynamoDB single-table schema with GSI achieving 72% duplicate elimination, saving 40+ hrs/month and 65% cost reduction
  • Migrated scrapers to auto-scaling EC2 with CloudWatch monitoring increasing success rate from 78% to 96% with 3x throughput
  • Pioneered multi-agent RAG chatbot using LangChain, Pinecone vector DB, and GPT-4 achieving 25% accuracy boost and <2s response time
AWSLambdaDynamoDBLangChainPineconeGPT-4FastAPI

Data Engineer

Tata Consultancy Services (PNC Bank)

Aug 2022 – Sept 2024 (2 years)
Pittsburgh, PA

Built enterprise-scale data pipelines and MLOps infrastructure for banking systems

  • Built 50+ production Hadoop data pipelines using PySpark processing 10TB+ daily data with 99.9% uptime and full regulatory compliance
  • Automated 100+ ETL workflows using Airflow DAGs achieving 70% manual effort reduction and 85% faster incident response
  • Optimized PySpark jobs with broadcast joins and Jenkins CI/CD achieving 40% performance improvement and $50K+ cost savings
  • Embedded end-to-end data lineage tracking with Apache Atlas achieving 35% faster reporting cycles with 100% audit trail
  • Mentored 4 junior engineers achieving 50% faster onboarding
SparkHadoopHiveKafkaAirflowPythonJenkinsDockerPySpark

Master of Science in Data Science

University of Maryland, College Park

Aug 2024 – May 2026 (Expected)
College Park, MD

GPA: 3.73/4.0

  • Advanced Machine Learning & Deep Learning
  • Big Data Systems & Cloud Computing Architecture
  • Natural Language Processing & MLOps
  • Data Engineering at Scale & Model Deployment

Bachelor of Technology in Mechatronics Engineering (Robotics Specialization)

SRM Institute of Science and Technology

2018 – 2022
Kattankulathur, Chennai, India

CGPA: 8.91/10.0

  • Robotics & Control: Theory, Practice, Applied Robotics
  • AI for Robotics & Vision, Computer Vision & Advanced CV
  • Embedded Systems: Raspberry Pi, Microcontrollers, Digital Systems
  • Machine Learning, Numerical Methods, Programming for Problem Solving

Featured Projects

Production-ready AI/ML systems across Technology, Banking, and Retail domains

Technology

Enterprise Multi-Tenant AI Platform

Production-grade SaaS platform with Kubernetes orchestration, tenant isolation via PostgreSQL schemas, real-time Kafka event streaming (100K+ events/sec), ML-driven resource optimization ($100K annual savings), and GPT-4 powered support assistant (40% ticket deflection). Features auto-scaling microservices (FastAPI), Snowflake analytics, Apache Flink real-time processing, MLflow model tracking, and comprehensive observability with Grafana. Reduces tenant onboarding from 4 hours to 10 minutes through automated provisioning.

Impact:100K+ events/sec, 4hrs → 10min onboarding, $100K savings
KubernetesKafkaSnowflakePostgreSQL+9 more
Technology

Intelligent MLOps Platform

End-to-end ML infrastructure reducing model deployment from 3 weeks to 2 days (90% improvement) supporting 40+ production models. Built with Kubeflow Pipelines for orchestration, MLflow experiment tracking, Feast feature store for real-time serving, TensorFlow Serving for inference, and GPT-4 powered code generation (85% accuracy). Features automated CI/CD with A/B testing, drift detection, model versioning, and comprehensive monitoring via Prometheus/Grafana. Includes LLM-assisted debugging and training cost optimization saving $80K annually.

Impact:3 weeks → 2 days deployment, 40+ models, $80K savings
KubeflowMLflowFeastSageMaker+9 more
Technology

Real-Time Observability Intelligence Platform

High-throughput monitoring platform processing 500K+ events/minute with <1s latency, reducing infrastructure costs by 65% vs Datadog ($180K → $63K annually). Built on ClickHouse for hot data storage, Vector log aggregator for ingestion, Kafka buffering, and Apache Flink stream processing. Features LSTM-based anomaly detection (92% precision), GPT-4 powered root cause analysis (<30s), automated log summarization via RAG, and LangChain alert explanation. Includes pattern mining, forecasting engine, custom Grafana dashboards, and PagerDuty integration. Reduces MTTR by 40%.

Impact:500K/min, <1s latency, 65% cost savings, 40% faster MTTR
ClickHouseVectorKafkaApache Flink+8 more
Banking

Real-Time Fraud Detection System

Advanced fraud detection system combining graph neural networks with ensemble machine learning processing 10M+ daily transactions with <100ms latency. Built on Neo4j graph database modeling entity relationships (accounts, merchants, devices) with GNN detecting suspicious patterns and community structures. Ensemble architecture combines XGBoost gradient boosting, Random Forest, and LightGBM with GPT-4 analyzing transaction narratives for anomaly detection. Real-time Kafka streaming with Apache Flink CEP (Complex Event Processing) for pattern matching and rule engines. Redis-backed feature store serving 200+ engineered features with sub-10ms lookup. ML-powered alert prioritization reducing false positives by 78% and analyst workload by 60%. Adaptive models retrained daily on labeled fraud cases achieving 95% precision, 89% recall, preventing $2M+ annual fraud losses.

Impact:$2M+ fraud prevented, 95% precision
KafkaFlinkXGBoostNeo4j+2 more
Banking

Responsible AI Credit Platform

EU AI Act compliant credit scoring platform serving 15K+ customers with full explainability and fairness guarantees achieving <3% default rate. Core LightGBM model trained on 80+ features (credit history, income, debt ratios, behavioral patterns) with SHAP TreeExplainer generating individual-level and global feature importance explanations for regulatory compliance. Fairlearn integration enforces demographic parity constraints across protected attributes (gender, ethnicity, age) ensuring equitable outcomes. GPT-4 powered natural language explanation system translating SHAP values into customer-friendly justifications. Automated fairness auditing pipeline detecting bias drift with Aequitas toolkit, generating compliance reports for EU AI Act Article 13-15 requirements. Aurora PostgreSQL storing audit trails with complete model lineage and decision provenance. Real-time scoring API with 250ms P95 latency, A/B testing framework for champion/challenger models, and automated retraining triggering on AUC degradation >2%. Reduces manual review time by 70% while maintaining strict fairness standards.

Impact:15K+ customers, <3% default
LightGBMSHAPFairlearnGPT-4+1 more
Banking

ML Transaction Reconciliation

Intelligent transaction reconciliation system processing 50M+ daily transactions across multiple payment rails (ACH, wire, card networks) achieving 99.2% automated match rate. Three-tier matching architecture: Level 1 exact matching on transaction IDs and amounts (85% match), Level 2 fuzzy matching using Levenshtein distance and phonetic algorithms for name/reference variations (12% match), Level 3 ML-based matching with XGBoost trained on 150+ engineered features including temporal patterns, amount clustering, merchant fingerprints (2% match). Apache Flink streaming processes real-time transaction feeds from Kafka topics with stateful windowing aggregations and CEP for pattern detection. GPT-4 powered exception handler analyzes remaining 0.8% unmatched cases, reasoning about data quality issues, missing information, and potential fraud, generating natural language explanations for manual review. Aurora PostgreSQL storing transaction states with optimistic locking for concurrent reconciliation workflows. Automated break analysis identifying systematic issues (missing feeds, format changes, timing shifts) with proactive alerting. Reduces reconciliation cycle from T+3 to T+0 (same-day), eliminates 95% manual effort, prevents $5M+ annual revenue leakage from unreconciled transactions.

Impact:50M+ txns/day, 99.2% auto-match
FlinkXGBoostGPT-4Aurora+1 more
Retail

Intelligent Customer 360 Platform

Unified customer data platform consolidating 50M+ profiles from 15+ data sources (web, mobile, POS, call center, email) achieving 85% identity resolution accuracy with real-time personalization. Probabilistic entity resolution using XGBoost trained on fuzzy matching features (name similarity, email patterns, address proximity, phone variations, device fingerprints) linking fragmented customer records across channels. Kafka streaming ingests 5M+ daily events with Apache Flink enrichment pipeline joining behavioral data (browsing, purchases, support tickets) in real-time. Snowflake data warehouse storing complete customer journey history with type-2 slowly changing dimensions for temporal analysis. Redis-backed profile cache serving unified customer views with <50ms latency to downstream systems (CRM, marketing automation, personalization engines). GPT-4 powered behavioral insights generating natural language customer summaries. ML-driven segmentation using K-means and RFM analysis, propensity scoring for cross-sell/upsell opportunities, and next-best-action recommendations. Increases marketing ROI by 45%, reduces customer service handle time by 30%, and improves personalization relevance by 62%.

Impact:50M+ profiles, 85% accuracy
KafkaFlinkXGBoostGPT-4+2 more
Retail

AI Supply Chain Optimization

End-to-end supply chain optimization platform combining demand forecasting with inventory planning reducing stockouts by 62% and saving $1.5M annually. Hybrid forecasting ensemble blending Facebook Prophet (capturing seasonality, holidays, trends) with LSTM neural networks (learning complex non-linear patterns) across 50K+ SKUs and 200+ store locations. Features engineered from historical sales, promotions, weather data, local events, and competitor pricing with external data enrichment via APIs. Google OR-Tools constraint optimization solving multi-echelon inventory allocation problem balancing service levels (98% target), working capital constraints ($50M limit), and warehouse capacity (500K units). Kafka streaming real-time sales data triggering dynamic reforecasting when actuals deviate >15% from predictions. GPT-4 powered root cause analysis explaining forecast errors with automated alert generation. Multi-objective optimization considering trade-offs: minimize stockouts vs holding costs vs expedited shipping. Simulation engine testing what-if scenarios for promotional events, supply disruptions, and demand shocks. Reduces excess inventory by 35%, improves forecast accuracy from MAPE 28% → 12%, and increases inventory turnover ratio by 40%.

Impact:62% stockout reduction, $1.5M savings
ProphetLSTMOR-ToolsGPT-4+1 more
Retail

Dynamic Pricing & Optimization

Reinforcement learning pricing engine processing 10K+ pricing decisions hourly across 25K+ products increasing revenue by 12% ($3M annually) while maintaining brand positioning. Q-Learning agent trained on 2+ years historical data learning optimal pricing strategies balancing revenue maximization, inventory clearance, and competitive positioning. State space captures 80+ features: demand elasticity, competitor prices, inventory levels, seasonality, customer segments, and margin constraints. XGBoost surrogate model predicting demand response curves for fast policy evaluation during online serving. Scrapy-based competitive intelligence platform monitoring 50+ competitor websites hourly, extracting prices, promotions, stock availability with GPT-4 NLP analyzing promotional language and value propositions. Real-time pricing API with Redis caching serving personalized prices based on customer segment, browsing history, and cart abandonment propensity. Multi-armed bandit testing for exploration-exploitation trade-off avoiding local optima. Constraint satisfaction ensuring prices respect MAP (Minimum Advertised Price) agreements, margin floors (20% minimum), and psychological pricing rules (ending in .99). A/B testing framework measuring causal impact with difference-in-differences methodology. Reduces manual pricing effort by 90%, improves price competitiveness index by 25%, and increases conversion rate by 8%.

Impact:+12% revenue ($3M), 10K+ decisions/hr
XGBoostQ-LearningGPT-4Scrapy+1 more

Hackathon Wins

3× Winner with $2,500+ in prizes building impactful AI solutions

AI & Food Insecurity Case Competition

$2,500 First Place

April 2025
University of Maryland × Capital Area Food Bank
Team TERPSTER (3 members: Aman, Dhanush, Supriya)

Featured on CBS News & UMD AI Media Day

Voice-First Multilingual AI Platform

Built an AI assistant supporting multiple languages, democratizing tech access for 56%+ non-English speakers. Featured on CBS News and showcased at UMD AI Media Day.

Achievement

Breaking barriers for 56%+ underserved populations

TwilioAWS LambdaLLMsGoogle Maps APIKafka

ServiceNow Knowledge Gap Challenge

$700 + Sony WH-1000XM4 Headphones (Winner)

October 2025
Knight Hacks VIII
4 members (642 participants)

Winner among 642 participants

Synapse: Multi-Agent AI Collaboration System

Neural network of specialized AI agents working together to solve complex problems through intelligent collaboration and orchestration.

Achievement

Neural network of specialized AI agents

ServiceNowMulti-Agent SystemsLLMsAI Orchestration

T. Rowe Price Investor Education Challenge

Portable Monitor (Winner)

November 2025
Technica 2025
2 members (Aman + Supriya, 434 participants)

Winner among 434 participants

GenAI Financial Literacy Platform

AI-powered financial education platform with personalized learning paths. Addressed 64% financial illiteracy gap among young adults.

Achievement

Addressing 64% financial illiteracy gap

GenAIFinancial EducationAccessibilityLLMs

Technical Arsenal

Mastery across 100+ cutting-edge technologies powering enterprise AI and data platforms

Core Languages

Python logo
Python
S
SQL
Scala logo
Scala
J
Java
R logo
R
Bash logo
Bash

AI & Machine Learning

TensorFlow logo
TensorFlow
PyTorch logo
PyTorch
Scikit-learn logo
Scikit-learn
OpenAI logo
OpenAI
Hugging Face logo
Hugging Face
XGBoost logo
XGBoost
LightGBM logo
LightGBM
LangChain logo
LangChain
LlamaIndex logo
LlamaIndex
Pandas logo
Pandas
N
NumPy
S
SHAP
F
Fairlearn
P
Prophet
L
LSTM

MLOps & Experimentation

MLflow logo
MLflow
Weights & Biases logo
Weights & Biases
FEAST logo
FEAST
D
DVC
F
Feature Store
K
Kubeflow
S
SageMaker
V
Vertex AI

Big Data & Streaming

Apache Spark logo
Apache Spark
Apache Kafka logo
Apache Kafka
Apache Flink logo
Apache Flink
Hadoop logo
Hadoop
Hive logo
Hive
Dask logo
Dask
D
Databricks
P
PySpark

Workflow Orchestration

Airflow logo
Airflow
P
Prefect
D
Dagster
d
dbt
S
Step Functions

Data Formats & Lakes

P
Parquet
A
Avro
D
Delta Lake
I
Iceberg

Cloud Platforms

GCP logo
GCP
Docker logo
Docker
Kubernetes logo
Kubernetes
Terraform logo
Terraform
A
AWS
S
S3
L
Lambda
E
EMR
G
Glue
R
Redshift
A
Athena
K
Kinesis
B
BigQuery

Databases

PostgreSQL logo
PostgreSQL
MySQL logo
MySQL
MongoDB logo
MongoDB
Redis logo
Redis
Neo4j logo
Neo4j
Snowflake logo
Snowflake
ClickHouse logo
ClickHouse
P
Pinecone
W
Weaviate
M
Milvus

DevOps & CI/CD

Git logo
Git
GitHub logo
GitHub
GitHub Actions logo
GitHub Actions
Jenkins logo
Jenkins
GitLab logo
GitLab
Grafana logo
Grafana
Prometheus logo
Prometheus
Datadog logo
Datadog
Elasticsearch logo
Elasticsearch

EXTENDED TOOLKIT

Additional frameworks, libraries, and tools in active use

MLflowWeaviateDatadoggRPCFastAPIFlaskGraphQLTwilioScrapySeleniumCeleryRabbitMQSupersetTableau

Get In Touch

Open to opportunities in Data Engineering, MLOps, and AI/ML Platform roles

Let's Build Something Amazing

Whether you have a project in mind, want to discuss opportunities, or just want to connect, I'd love to hear from you. Feel free to reach out through any of the channels below.