& Preparation
Session 1
Introduction to data science workflows and preparation techniques for ML projects
2026 WayUp
Why data preparation is critical for ML success
Data science requires iterative experimentation with different transformations and approaches
Quality of data preparation has massive impact on model performance
Understanding data formats (JSON, XML, CSV, APIs) and their constraints is essential
Without structure, data pipelines become messy and hard to maintain
From raw data to production models
Why data preparation gets complex fast
Problem: With multiple transformation options for each feature, the number of possible pipelines explodes exponentially.
What algorithms actually need
Examples: Age (int), Temperature (float), Salary (float)
Examples: Gender (nominal), Education Level (ordinal), Is_Customer (boolean)
Algorithms want "rectangle" data
DataFrame: A 2D table where rows are observations and columns are features, with one column as the target variable.
Each row represents a single data point or instance (e.g., one customer, one transaction)
Each column represents a variable or attribute (e.g., age, price, category)
One special column contains the label or value you want to predict
| Feature_1 | Feature_2 | Feature_3 | Target |
|---|---|---|---|
| 25 | Male | 50000 | Purchased |
| 34 | Female | 65000 | Not |
| 45 | Male | 78000 | Purchased |
| ... | ... | ... | ... |
Real-world data comes in many forms
Data preparation requires software engineering skills, not just statistics
Data Scientists vs AI Engineers have different focuses, but both need to handle diverse data sources
Get data from various sources (APIs, databases, files)
Convert to ML-ready format (numerical/categorical)
Create clean DataFrame for model training
Choosing the right approach for your workflow
Pros: Interactive, visual feedback, great for exploration
Cons: Cell execution order issues, hard to version control properly
Pros: Reproducible, easy to version, can be automated
Cons: Less interactive, slower feedback loop
Pros: No coding required, drag-and-drop workflow, great for learning
Cons: Limited customization, not production-ready
Our tool for hands-on learning
pip install orange3
Apply data exploration concepts hands-on
Analyze the famous Titanic dataset to understand survival patterns using Orange
Import titanic_train.csv into Orange
How many features? How many rows? What's the target?
Use Distributions widget to explore survival by class and gender
Which features correlate with survival? Age? Class? Gender?