Data Exploration

& Preparation

Session 1

Introduction to data science workflows and preparation techniques for ML projects

The Data Preparation Challenge

Why data preparation is critical for ML success

Experimental Process

Data science requires iterative experimentation with different transformations and approaches

Performance Impact

Quality of data preparation has massive impact on model performance

Format Knowledge

Understanding data formats (JSON, XML, CSV, APIs) and their constraints is essential

Organization

Without structure, data pipelines become messy and hard to maintain

Reality check: Data preparation often takes 60-80% of a data scientist's time in real-world projects.

The Data Science Workflow

From raw data to production models

1

Collect

Raw data

2

Clean

Handle missing

3

Engineer

Features

4

Train

Build model

5

Evaluate

Test & tune

6

Deploy

Production

Data Preparation (Steps 1-3)

Acquire data from various sources
Handle missing values, outliers, duplicates
Transform features for ML algorithms

Model Development (Steps 4-6)

Train models with prepared data
Iterate: evaluate → improve features → retrain
Deploy validated model to production

Non-linear process: The workflow is iterative—you'll loop back to feature engineering based on evaluation results.

The Combinatorial Explosion

Why data preparation gets complex fast

Problem: With multiple transformation options for each feature, the number of possible pipelines explodes exponentially.

Example: Single Numerical Feature

Scaling Options
StandardScaler, MinMaxScaler, RobustScaler, Normalizer

Missing Value Handling
Mean, Median, Mode, Drop, Forward Fill, Interpolate

Outlier Treatment
Remove, Cap, Transform (log/sqrt), Keep

Reality

Even a single feature has dozens of transformation paths
Multiply this by all features in your dataset
Execution is not linear—you try, evaluate, backtrack
Reproducibility becomes critical

Solution: Use systematic workflows and tools (like Orange or sklearn pipelines) to manage complexity.

ML-Ready Data Types

What algorithms actually need

Numerical Data

Integer: Counts, IDs, discrete values
Float: Continuous measurements, prices, coordinates

Examples: Age (int), Temperature (float), Salary (float)

Categorical Data

Nominal: No order (color, country)
Ordinal: Ordered categories (rating, size)
Boolean: True/False, Yes/No

Examples: Gender (nominal), Education Level (ordinal), Is_Customer (boolean)

Key insight: ML algorithms only accept these types. Everything else (text, images, dates, URLs) must be transformed into numerical or categorical features.

The DataFrame: ML's Data Format

Algorithms want "rectangle" data

DataFrame: A 2D table where rows are observations and columns are features, with one column as the target variable.

Rows = Observations

Each row represents a single data point or instance (e.g., one customer, one transaction)

Columns = Features

Each column represents a variable or attribute (e.g., age, price, category)

Target Column

One special column contains the label or value you want to predict

Feature_1	Feature_2	Feature_3	Target
25	Male	50000	Purchased
34	Female	65000	Not
45	Male	78000	Purchased
...	...	...	...

Why This Course?

Real-world data comes in many forms

The Reality

Data preparation requires software engineering skills, not just statistics

Data Scientists vs AI Engineers have different focuses, but both need to handle diverse data sources

Data Sources We'll Cover

JSON & XML files
CSV & ARFF formats
SQL Databases
REST APIs
Web scraping

1 Extract

Get data from various sources (APIs, databases, files)

2 Transform

Convert to ML-ready format (numerical/categorical)

3 Load

Create clean DataFrame for model training

Tools & Methodology

Choosing the right approach for your workflow

Jupyter Notebooks

Pros: Interactive, visual feedback, great for exploration

Cons: Cell execution order issues, hard to version control properly

Python Scripts

Pros: Reproducible, easy to version, can be automated

Cons: Less interactive, slower feedback loop

Orange (Visual)

Pros: No coding required, drag-and-drop workflow, great for learning

Cons: Limited customization, not production-ready

Best practice: Start with Orange or Jupyter for exploration, then transition to Python scripts for production pipelines.

Orange: Visual Data Mining

Our tool for hands-on learning

What is Orange?

Open-source visual programming tool for data analysis
Drag-and-drop widgets for loading, transforming, and modeling
Perfect for understanding ML workflows without code
Outputs Python code you can learn from

Installation

pip install orange3

Download

orangedatamining.com

Available for Windows, macOS, and Linux

Pro tip: Orange is excellent for rapid prototyping. Build your workflow visually, then export to Python for production use.

Practical Exercise: Titanic Dataset

Apply data exploration concepts hands-on

Your Mission

Analyze the famous Titanic dataset to understand survival patterns using Orange

1 Load Data

Import titanic_train.csv into Orange

2 Explore Structure

How many features? How many rows? What's the target?

3 Visualize

Use Distributions widget to explore survival by class and gender

4 Find Patterns

Which features correlate with survival? Age? Class? Gender?

Hint: The Distributions widget shows how survival rate varies by passenger class or gender. Women and first-class passengers had higher survival rates.

Data Exploration

The Data Preparation Challenge

Experimental Process

Performance Impact

Format Knowledge

Organization

The Data Science Workflow

Data Preparation (Steps 1-3)

Model Development (Steps 4-6)

The Combinatorial Explosion

Example: Single Numerical Feature

Reality

ML-Ready Data Types

Numerical Data

Categorical Data

The DataFrame: ML's Data Format

Rows = Observations

Columns = Features

Target Column

Why This Course?

The Reality

Data Sources We'll Cover

1 Extract

2 Transform

3 Load

Tools & Methodology

Jupyter Notebooks

Python Scripts

Orange (Visual)

Orange: Visual Data Mining

What is Orange?

Installation

Download

Practical Exercise: Titanic Dataset

Your Mission

1 Load Data

2 Explore Structure

3 Visualize

4 Find Patterns

Slide Overview