View on GitHub

Jemael Nzihou – Data Science Portfolio

Research, analytics, and applied data science projects

πŸ“Š Data Science Projects

πŸ” AI-Powered Cyber Risk Scoring Engine

πŸ“Œ Overview

This project develops a machine learning-based cyber risk scoring engine that classifies enterprise assets into Low, Medium, High, and Critical risk categories.

It integrates cybersecurity, governance, and data science to move beyond traditional spreadsheet-based risk assessments toward predictive, data-driven decision-making.


🎯 Business Problem

Organizations often rely on manual and subjective methods to assess cyber risk, which are:

This project demonstrates how machine learning can improve:


🧠 Objectives


πŸ“Š Features Used


βš™οΈ Model

A Random Forest Classifier was used for multiclass classification.


πŸ“ˆ Model Performance

Key Observations:


πŸ” Key Findings

The most influential drivers of cyber risk include:

These findings confirm that:

Delayed remediation, weak control posture, and unresolved vulnerabilities significantly increase cyber risk exposure.


βš™οΈ Practical Impact (GRC)

This model enables:


πŸ’Ό Executive Value

This solution supports:

It demonstrates how AI can transform cybersecurity from reactive to proactive and predictive.


⚠️ Limitations

Future Improvements:


πŸ› οΈ Tech Stack


πŸš€ Future Enhancements


πŸ”— Keywords

Cybersecurity, GRC, Risk Management, Machine Learning, Data Science, AI, Governance, Compliance, Cyber Risk, Predictive Analytics


πŸ“‰ Telco Customer Churn β€” End-to-End Decision Intelligence System

Designing a production-grade churn system that converts ML signals into revenue-preserving decisions

This project demonstrates how modern data science is applied inside real companies: from ambiguous business problems to clear, defensible actions.

Unlike tutorial projects, this system emphasizes:


🎯 Problem Statement

Subscription businesses lose millions annually to churn. The challenge is not predicting churn, but deciding:

This project operationalizes churn management using a decision framework aligned with how teams at Google, Meta, Amazon, Netflix, and Microsoft work.


🧠 System Architecture (High Level)

Raw Customer Data
        ↓
Business EDA (Phase 1)
        ↓
Behavioral Segmentation (Phase 2)
        ↓
Churn Modeling (Phase 3)
        ↓
Decision Layer (Phase 4)
        ↓
Retention Actions (CRM / Ops Ready)

Core decision logic:

Segment Γ— Churn Risk Γ— Customer Value

πŸ” Phase 1 β€” Business EDA

Understand churn as an economic problem

Key Visuals

πŸ“Š Overall Churn Rate

churn_rate

~27% churn β†’ material revenue risk requiring targeted intervention

πŸ“Š Churn by Contract Type

Churn by Contract

Month-to-month customers churn 3–4Γ— more than long-term contracts

πŸ“Š Churn by Tenure Band

churn by tenure

Highest churn occurs in the first 6–12 months

πŸ“Š Churn by Internet Service

churn by service

Fiber optic users show elevated churn β†’ expectation gap


πŸ‘₯ Phase 2 β€” Behavioral Segmentation

Move from β€œall customers” to decision-ready personas

πŸ“Š Elbow Method for K-Means

elbow_method

Four stable, interpretable segments selected

πŸ‘₯ Customer Segmentation β€” K-Means (PCA Projection)

kmeans_pca_projection

Customers cluster into four distinct behavioral personas based on tenure, spend, and service usage.

🎯 Segment Centers β€” K-Means with Centroids

kmeans pca centroids

Centroids represent the behavioral β€œcenter” of each segment, enabling stable personas and consistent downstream decision-making.

πŸ“Š Segment Profiles

segment profiles

Segment Churn Value Business Meaning
High value + Low churn Low Very high Core revenue base
High value + High churn High Moderate Revenue at risk
Low value + High churn High Low Poor ROI
Low value + Low churn Low Low Stable, low margin

πŸ€– Phase 3 β€” Churn Modeling

Predict churn with explainability

πŸ“Š Global Model Performance

global model metrics

πŸ“Š Top Churn Drivers

churn drivers

Increases churn

Reduces churn

πŸ“Œ The model explains what to fix, not just who might leave.


🎯 Phase 4 β€” Decision Layer

From prediction β†’ action

πŸ“Š Segment Strategy Matrix

segment strategy

Segment Type Action
High value + High churn Aggressive retention
High value + Low churn Loyalty rewards
Low value + High churn Minimal spend
Low value + Low churn Maintain

πŸ“Š Customer-Level Retention Priority

top retention targets

Each customer receives:

This output is CRM-ready.


🧠 Impact

This system enables leadership to reduce churn while protecting margin, by acting only where ROI is positive.

What this project demonstrates:

🧠 Why This Matters

Most churn projects stop at β€œwho might churn.” This project answers β€œwho should we act on, and why.”

That distinction is what separates academic ML from production data science.

🏁 Final Note

This project mirrors how churn analytics is built and deployed in real organizations β€” combining modeling, segmentation, and business decision-making into one system.


🧠 Project 1 β€” Physics-Informed Neural Networks (PINNs)

Heat Transfer Modeling in Chemical Reactors


πŸ” Overview

This project implements a Physics-Informed Neural Network (PINN) to model transient heat diffusion in a chemical reactor using first-principles physics embedded into a neural network.

Rather than relying purely on data, the model enforces the 1D heat equation during training, enabling physically consistent predictions even with sparse or noisy measurementsβ€”a key requirement for engineering systems and digital twins.


🎯 Objectives


πŸ“ Governing Physics (Engineering Form)

Heat diffusion (1D, transient):

βˆ‚T/βˆ‚t = Ξ± Β· βˆ‚Β²T/βˆ‚xΒ²

Where:

Boundary conditions:

T(0,t) = 0
T(1,t) = 0

Initial condition:

T(x,0) = sin(Ο€x)

Analytical validation solution:

T(x,t) = exp(-α·π²·t) Β· sin(Ο€x)

🧠 Methods & Model Design

πŸ“Š Visual Results

πŸ” Training Loss Convergence (Total, PDE, BC, IC)

Shows stable convergence of the physics-constrained loss components.

Training Loss Curves

🧠 Interpretation

The periodic spikes are normal in PINNs and occur because:

πŸ”¬ Physical meaning

The network is not just fitting data β€” it is learning a temperature field that obeys energy conservation throughout the domain.

βœ”οΈ This confirms successful physics enforcement, not just numerical curve fitting.

🌑️ Temperature Profiles β€” PINN vs Analytical

Comparison across multiple time slices validates physical accuracy.

PINN vs Analytical Solution

Agreement holds across all time slices:

🧠 Interpretation

The PINN accurately captures:

πŸ”¬ Physical meaning

The model has learned:

This shows the PINN has internalized the governing physics, not memorized discrete points.

βœ”οΈ This level of overlap is equivalent to a high-resolution numerical solver.

πŸ”₯ Absolute Error Heatmap (Space–Time)

Highlights regions of higher error and overall solution fidelity.

Absolute Error Heatmap

🧠 Interpretation

Higher error near t=0 is common because:

Boundary regions are more sensitive due to:

Importantly:

πŸ”¬ Physical meaning

The PINN provides a globally consistent thermal field, suitable for:

The smooth error structure indicates numerical stability, not overfitting.

πŸ“Š Quantitative Summary (From Metrics)

Interpretation

βœ”οΈ This validates PINNs as a credible alternative to traditional solvers.

🧠 Big-Picture Insight

This experiment shows that the PINN:

In other words:

πŸ“Œ Final Insight

A Physics-Informed Neural Network was trained to solve the 1D transient heat equation, achieving sub-percent error and near-perfect agreement with analytical solutions across the full space–time domain.


πŸ“ˆ Key Outputs

The project generates:


πŸ’Ό Applications


πŸ›  Tools & Technologies


πŸ“ Project Files


πŸš€ Future Extensions


πŸ“Š Project 2 β€” Dynamic Temperature & Velocity Analysis in Engineering Systems

πŸ” Overview

This project applies data science, exploratory data analysis (EDA), and predictive modeling to analyze thermal and fluid dynamic behavior in two critical engineering systems:

By combining physics-based simulation, statistical summaries, and predictive trend analysis, the project demonstrates how data science can be used to monitor system stability, detect deviations, and support operational decision-making in industrial environments.


🎯 Objectives


πŸ§ͺ System 1 β€” Heat Exchanger Temperature Distribution

This visualization shows how temperature evolves over time and distance inside a heat exchanger.

Temperature Distribution in Heat Exchanger

Key Insights


πŸ”₯ System 2 β€” Dynamic Temperature Control in a Chemical Reactor

A spatio-temporal view of temperature regulation inside a reactor.

Dynamic Temperature Control

Key Insights


πŸ“ˆ Exploratory Data Analysis (EDA) Across Systems

Summary statistics across three subsystems:

EDA Summary

Metrics Analyzed

Why It Matters


🚰 Pipe Flow Analysis β€” Velocity vs Radius

Observed vs predicted velocity distribution along pipe radius.

Velocity Profile

Key Insights

πŸ€– Predictive Modeling β€” Reactor Mid-Position Temperature

Comparison between observed and predicted reactor temperature over time.

Reactor Temperature Prediction

Key Insights


🧠 Data Science Techniques Used


πŸ›  Tools & Technologies

πŸ“‚ Notebook:

Heat Exchangers vs. Reactors – The Role of Dynamic Temperature & Fluid Velocity Profiles.ipynb


πŸš€ Applications


⭐ Why This Project Matters

This project demonstrates how data science bridges theory and real-world engineering systems, enabling:


πŸ“Š Project 3: Photolithography Yield Risk Prediction

AI-Driven Pass/Fail Modeling for Semiconductor Manufacturing

Project Type: Industrial Data Science Β· Manufacturing AI Β· Explainable ML
Dataset: SECOM Semiconductor Manufacturing Dataset (UCI ML Repository)


🧰 Tools & Technologies


πŸ”¬ Project Overview

Modern semiconductor manufacturingβ€”especially photolithographyβ€”operates under extremely tight process windows. Small deviations in exposure, focus, thermal stability, or tool health can lead to critical dimension (CD) or overlay excursions, resulting in yield loss.

This project develops an AI-driven pass/fail risk prediction system using real semiconductor process sensor data.
The objective is to identify yield risk before downstream metrology, enabling:


🎯 Business & Engineering Objective

Problem Statement

Can we predict whether a manufacturing run will PASS or FAIL specification using high-dimensional process sensor dataβ€”before final inspection?

Why This Matters


🧠 Dataset Description

Source: UCI Machine Learning Repository – SECOM Dataset

Attribute Value
Samples 1,567 manufacturing runs
Sensors 590 process variables
Target Pass / Fail

Key Characteristics

βœ… This makes the dataset highly realistic for semiconductor manufacturing analytics.


πŸ”„ Data Science Lifecycle (Photolithography Context)

1️⃣ Problem Definition

2️⃣ Data Collection

Process telemetry representing:

3️⃣ Data Understanding

4️⃣ Data Cleaning & Wrangling

5️⃣ Exploratory Data Analysis (EDA)

class_distribution

missing_fraction_histogram

sensor_correlation_heatmap

EDA highlights:


6️⃣ Feature Engineering


7️⃣ Modeling


8️⃣ Model Evaluation

roc_curves_logreg_vs_rf

Model ROC-AUC
Logistic Regression ~0.64
Random Forest ~0.78

Interpretation


9️⃣ Operational Insight: Confusion Matrix

rf_confusion_matrix

At default thresholds, the Random Forest behaves conservatively, flagging nearly all runs as FAIL.

Prediction β‰  Decision

Threshold tuning is essential to balance yield protection vs throughput.


πŸ” Explainability & Root-Cause Insight

top_sensor_importances

Key observations:


πŸ“‘ Deployment & Drift Monitoring

psi_drift_monitoring

This mirrors how fabs monitor equipment health in production.


πŸš€ Why This Project Stands Out

βœ” Real semiconductor manufacturing data
βœ” High-dimensional, imbalanced industrial ML problem
βœ” Strong focus on explainability and deployment readiness
βœ” Direct relevance to AI chip production and advanced nodes


πŸ“Œ Future Enhancements


πŸ“ Project Files

πŸ““ Photolithography_Project.ipynb


Project 4: Reliability Analysis & Survival Modeling

Kaplan–Meier, Hazard Functions, and Batch Comparison


πŸ“Œ Project Overview

This project focuses on reliability engineering and time-to-event analysis using survival analysis techniques. The objective is to model failure behavior over time, quantify survival probabilities, and compare reliability performance across manufacturing batches.

The analysis applies industry-standard statistical methods widely used in manufacturing, aerospace, defense, and semiconductor reliability studies.


🎯 Objectives


🧠 Methods & Techniques


πŸ“Š Key Visualizations & Insights

1️⃣ Kaplan–Meier Survival Curve

This plot estimates the probability that a unit survives beyond a given time.

Steep early decline indicates early-life failures

Gradual tail suggests wear-out behavior

Confidence bands show estimation uncertainty over time

Kaplan–Meier Survival Curve


2️⃣ Time-to-Event Distribution (Failure vs. Censored)

This visualization contrasts observed failures against censored observations.

Failures dominate early time periods

Censored observations increase at later times

Confirms the need for survival modeling vs. simple averages

Time to Event Distribution


3️⃣ Survival Probability Comparison (Batch A vs. Batch B)

This comparison highlights reliability differences between manufacturing batches.

Batch A demonstrates consistently higher survival probability

Batch B experiences earlier degradation

Confidence intervals reflect statistical uncertainty

Survival Curve by Batch


Smoothed curves help reveal underlying reliability trends.

Batch A shows delayed failure onset

Batch B exhibits faster reliability decay

Useful for management-level interpretation

Survival Curve by Batch


5️⃣ Hazard Function Analysis

The hazard function represents the instantaneous failure rate.

Increasing hazard rate indicates aging or wear-out failure mode

Critical for maintenance planning and lifecycle decisions

Hazard Function


πŸ“ˆ Business & Engineering Impact


πŸ›  Tools & Technologies


πŸ“ Notebook: Reliability Analysis Using Weibull Modeling.ipynb


πŸ” Key Takeaway

Survival analysis provides a statistically robust framework to evaluate reliability, account for censored data, and compare manufacturing performance across batchesβ€”far beyond traditional MTBF metrics.


πŸ“Š Project 5: Oath–Outcome Alignment Analysis

From Constitutional Promises to Measurable Outcomes


πŸ“Œ Project Overview

This project applies data science, statistical modeling, and natural language processing (NLP) to evaluate whether real-world institutional outcomes align with the constitutional obligations defined in official government oaths.

Public institutions in the United Statesβ€”military, law enforcement, judiciary, and civil governmentβ€”derive their authority from oaths sworn to the U.S. Constitution. While these oaths establish clear legal and ethical obligations, there is limited quantitative research measuring how closely institutional behavior aligns with those commitments.

This project addresses that gap by converting normative legal principles into measurable signals and comparing them against observed institutional outcomes.


🎯 Research Question

Do institutional outcomes align with the constitutional obligations defined in official oaths?


🧠 Why This Matters

Relevant to:


πŸ“Š Key Visualizations

πŸ”΅ Oath vs Outcome Radar Chart (Law Enforcement Example)

πŸ”΅ Oath vs Outcome Radar Chart

Oath vs Outcome Radar Chart

Interpretation


πŸ“‰ Distribution of OOAS Across Agencies

OOAS Distribution

Insight


πŸ”₯ OOAS Heatmap by Agency & State

OOAS Heatmap

Insight


πŸ”¬ Methodology

Text Analysis (NLP)

Feature Engineering

Statistical Modeling

Visualization

πŸ“Š Project 6: Data Center Insights with Data Science & Engineering

Operational Intelligence, Reliability, and Performance Optimization


πŸ“Œ Project Overview

This project applies data science methods grounded in engineering principles to analyze and interpret data center operational behavior, focusing on thermal stability, energy consumption, and communication efficiency.

Modern data centers function as tightly coupled cyber-physical systems. Small deviations in temperature, power usage, or communication latency can propagate into equipment stress, efficiency loss, or reliability risk. This project demonstrates how engineering-aware analytics can support proactive monitoring and decision-making.


🎯 Analytical Objectives


πŸ“ˆ Key Visual Analyses

🌑️ Reactor / Equipment Temperature Monitoring (24-Hour Cycle)

Simulated Reactor Temperature Over 24 Hours

Insight


⚑ Power Consumption Patterns in a Data Center

Power Consumption Over a Day

Insight


πŸš€ Communication Latency: Optical vs Electrical Transmission

Latency Comparison: Light vs Electrical Communication

Insight


🧠 Engineering + Data Science Integration

This project explicitly connects:

Rather than treating data as abstract, each variable is interpreted within its physical and operational context.


πŸ§ͺ Deliverables


πŸš€ Future Extensions


πŸ“ Project Files

πŸ““ The Data Center insights with Data science and engineering (1).ipynb


πŸ“‘ Project 7: Wi-Fi Optimization & Communication Performance Analysis

Signal Quality, Reliability, and Network Efficiency


πŸ“Œ Project Overview

This project applies data science, signal processing concepts, and network engineering principles to analyze wireless communication performance, with a focus on signal reliability, coverage quality, and user-level optimization.

Wireless networks are fundamental to modern digital infrastructure, yet their performance is constrained by noise, interference, distance, and infrastructure placement. This project demonstrates how engineering-informed analytics can be used to evaluate and optimize Wi-Fi performance using quantitative signal metrics.


🎯 Analytical Objectives


πŸ“ˆ Key Visual Analyses

πŸ“‰ Bit Error Rate vs Signal-to-Noise Ratio (QPSK)

QPSK BER vs SNR

Insight


πŸ—ΊοΈ Wi-Fi Coverage Map: Average SNR by Region

Wi-Fi Coverage Map

Insight


πŸš€ User Optimization: SNR vs Throughput

Wi-Fi Optimization: SNR vs Throughput

Insight


🧠 Engineering & Data Science Integration

This project integrates:

Each result is interpreted in the context of physical signal behavior and network performance limits.


πŸ§ͺ Deliverables


πŸš€ Future Enhancements


πŸ“ Project Files

πŸ““ Wifi optimization (1).ipynb


πŸ‘€ Author

Jemael Nzihou PhD Student β€” Data Science Chemical Engineer | Business Analytics | Quality Champion certified
πŸ”— Portfolio: https://jemaelnzihou.github.io/Jemael-Nzihou-Portfolio/
πŸ”— LinkedIn: https://www.linkedin.com/in/jemaelnzihou

Quality Champion Credential

IBM Data Science Professional Certificate

Business Intelligence Professional Certificate

Google Advanced Data Analytics


πŸ“œ License

This project is released for research and educational use. Please cite appropriately if used in academic or policy work.