Data Exploration

& Preparation

Session 1

Introduction to data science workflows and preparation techniques for ML projects

The Data Preparation Challenge

Why data preparation is critical for ML success

Experimental Process

Data science requires iterative experimentation with different transformations and approaches

Performance Impact

Quality of data preparation has massive impact on model performance

Format Knowledge

Understanding data formats (JSON, XML, CSV, APIs) and their constraints is essential

Organization

Without structure, data pipelines become messy and hard to maintain

Reality check: Data preparation often takes 60-80% of a data scientist's time in real-world projects.

The Data Science Workflow

From raw data to production models

1
Collect
Raw data
2
Clean
Handle missing
3
Engineer
Features
4
Train
Build model
5
Evaluate
Test & tune
6
Deploy
Production

Data Preparation (Steps 1-3)

  • Acquire data from various sources
  • Handle missing values, outliers, duplicates
  • Transform features for ML algorithms

Model Development (Steps 4-6)

  • Train models with prepared data
  • Iterate: evaluate → improve features → retrain
  • Deploy validated model to production
Non-linear process: The workflow is iterative—you'll loop back to feature engineering based on evaluation results.

The Combinatorial Explosion

Why data preparation gets complex fast

Problem: With multiple transformation options for each feature, the number of possible pipelines explodes exponentially.

Example: Single Numerical Feature

Scaling Options
StandardScaler, MinMaxScaler, RobustScaler, Normalizer
Missing Value Handling
Mean, Median, Mode, Drop, Forward Fill, Interpolate
Outlier Treatment
Remove, Cap, Transform (log/sqrt), Keep

Reality

  • Even a single feature has dozens of transformation paths
  • Multiply this by all features in your dataset
  • Execution is not linear—you try, evaluate, backtrack
  • Reproducibility becomes critical
Solution: Use systematic workflows and tools (like Orange or sklearn pipelines) to manage complexity.

ML-Ready Data Types

What algorithms actually need

Numerical Data

  • Integer: Counts, IDs, discrete values
  • Float: Continuous measurements, prices, coordinates

Examples: Age (int), Temperature (float), Salary (float)

Categorical Data

  • Nominal: No order (color, country)
  • Ordinal: Ordered categories (rating, size)
  • Boolean: True/False, Yes/No

Examples: Gender (nominal), Education Level (ordinal), Is_Customer (boolean)

Key insight: ML algorithms only accept these types. Everything else (text, images, dates, URLs) must be transformed into numerical or categorical features.

The DataFrame: ML's Data Format

Algorithms want "rectangle" data

DataFrame: A 2D table where rows are observations and columns are features, with one column as the target variable.

Rows = Observations

Each row represents a single data point or instance (e.g., one customer, one transaction)

Columns = Features

Each column represents a variable or attribute (e.g., age, price, category)

Target Column

One special column contains the label or value you want to predict

Feature_1 Feature_2 Feature_3 Target
25 Male 50000 Purchased
34 Female 65000 Not
45 Male 78000 Purchased
... ... ... ...

Why This Course?

Real-world data comes in many forms

The Reality

Data preparation requires software engineering skills, not just statistics

Data Scientists vs AI Engineers have different focuses, but both need to handle diverse data sources

Data Sources We'll Cover

  • JSON & XML files
  • CSV & ARFF formats
  • SQL Databases
  • REST APIs
  • Web scraping

1 Extract

Get data from various sources (APIs, databases, files)

2 Transform

Convert to ML-ready format (numerical/categorical)

3 Load

Create clean DataFrame for model training

Tools & Methodology

Choosing the right approach for your workflow

Jupyter Notebooks

Pros: Interactive, visual feedback, great for exploration

Cons: Cell execution order issues, hard to version control properly

Python Scripts

Pros: Reproducible, easy to version, can be automated

Cons: Less interactive, slower feedback loop

Orange (Visual)

Pros: No coding required, drag-and-drop workflow, great for learning

Cons: Limited customization, not production-ready

Best practice: Start with Orange or Jupyter for exploration, then transition to Python scripts for production pipelines.

Orange: Visual Data Mining

Our tool for hands-on learning

What is Orange?

  • Open-source visual programming tool for data analysis
  • Drag-and-drop widgets for loading, transforming, and modeling
  • Perfect for understanding ML workflows without code
  • Outputs Python code you can learn from

Installation

pip install orange3

Download

orangedatamining.com

Available for Windows, macOS, and Linux

Pro tip: Orange is excellent for rapid prototyping. Build your workflow visually, then export to Python for production use.

Practical Exercise: Titanic Dataset

Apply data exploration concepts hands-on

Your Mission

Analyze the famous Titanic dataset to understand survival patterns using Orange

1 Load Data

Import titanic_train.csv into Orange

2 Explore Structure

How many features? How many rows? What's the target?

3 Visualize

Use Distributions widget to explore survival by class and gender

4 Find Patterns

Which features correlate with survival? Age? Class? Gender?

Hint: The Distributions widget shows how survival rate varies by passenger class or gender. Women and first-class passengers had higher survival rates.

Slide Overview