Dataset Preparation

By Example

Session 2

Hands-on data exploration and feature engineering using California fire incidents

Interactive Data Exploration

Learning by doing with real-world datasets

This Session is Interactive

A main exploration scenario is prepared, but you can propose your own ideas and experiments along the way

Dataset

California Fire Incidents

Real data about fire incidents in California, including location, resources deployed, and outcomes

Download Dataset

Learning Goals

  • Explore and understand features
  • Handle missing values and duplicates
  • Engineer new features from dates
  • Deal with imbalanced data
  • Merge external datasets

California Fire Incidents Schema

Understanding the key features

Target & Location

MajorIncidentTarget variable (boolean)
LatitudeGPS coordinate
LongitudeGPS coordinate
CountiesGeographic region
AcresBurnedFire size metric

Resources & Timeline

StartedFire start date/time
ExtinguishedFire end date/time
PersonnelInvolvedCount of firefighters
EnginesNumber deployed
HelicoptersNumber deployed
StructuresDestroyedDamage count
First task: Load this dataset in Orange and explore its structure.

Dataset Discovery

Understanding composition and quality

1 Understand Structure

How many rows? How many columns? What data types?

2 Identify Questions

What questions could this dataset answer?

3 Assess Quality

Are there missing values? Duplicates? Outliers?

4 Feature Analysis

Which features relate to potential target variables?

Exercise: Take time to explore the dataset visually using Orange's Data Table and Distributions widgets before proceeding.

Our Prediction Question

Defining the machine learning task

Can we predict if a fire will become a major incident?

The target is defined by the MajorIncident feature (boolean)

Re-explore the Data

Now that you know the target, revisit the dataset with this specific question in mind

Visualize Relationships

Use scatterplots, distributions, and geomaps to understand patterns

First Filtering

Remove duplicates and obviously irrelevant features

Orange tip: Install the Geo add-on to visualize fire locations on a map.

Removing Duplicate Records

Using pandas in Orange's Python Script widget

Orange doesn't have a built-in deduplicate widget, so we use a Python Script with pandas

from Orange.data.pandas_compat import table_from_frame, table_to_frame

# Convert Orange table to pandas DataFrame
df = table_to_frame(in_data)

# Remove duplicate rows
df = df.drop_duplicates()

# Convert back to Orange table
out_data = table_from_frame(df)
Result: This removes exact duplicate rows while preserving the rest of your data.

Target Variable Balance

Assessing class distribution

The Numbers

66%

Non-major incidents (False)

34%

Major incidents (True)

Imbalanced Dataset

This is imbalanced, but not critically so

Ratio: approximately 2:1 (non-major to major)

The Problem

A model that always predicts "False" would achieve 66% accuracy

This baseline affects some algorithms' learning

Impact: Imbalanced datasets can bias probability estimates and reduce model performance on the minority class.

Handling Imbalanced Classes

Strategies to balance your dataset

Oversampling

Duplicate minority class samples to match majority class size

Pro: No data loss

Con: Risk of overfitting

Undersampling

Randomly remove majority class samples to match minority class

Pro: Fast, balanced dataset

Con: Data loss

Class Weights

Give higher penalty to misclassifying minority class

Pro: No data modification

Con: Not all algorithms support it

Collect More Data

Find additional minority class examples

Pro: Best solution

Con: Often not feasible

For this dataset: We'll use undersampling to create a balanced training set.

Implementing Undersampling

Python Script widget in Orange

Use Orange's Python Script widget with this code to balance the classes

import pandas as pd
from Orange.data.pandas_compat import table_from_frame, table_to_frame

# Convert to pandas
df = table_to_frame(in_data)

# Separate by class
major = df[df['MajorIncident'] == True]
non_major = df[df['MajorIncident'] == False]

# Undersample majority class to match minority
non_major_sampled = non_major.sample(n=len(major), random_state=42)

# Combine both classes
balanced_df = pd.concat([major, non_major_sampled])

# Shuffle the dataset
balanced_df = balanced_df.sample(frac=1, random_state=42).reset_index(drop=True)

# Convert back to Orange table
out_data = table_from_frame(balanced_df)
Result: Both classes now have equal representation in the training data.

First Model Evaluation

Testing with Random Forest

1
Load
Raw data
2
Clean
Duplicates
3
Balance
Undersample
4
Train
Random Forest
5
Evaluate
Metrics

Are you happy with the results?

Look at accuracy, precision, recall, and F1-score

Check Feature Importance

Which features contribute most to predictions?

Should features be selected?

Can we remove irrelevant or redundant features?

What extra data could help?

Weather? Terrain? Vegetation data?

Dealing with Missing Values

Three main strategies when you cannot recover data

1 Remove

Drop rows or columns with missing values

Use when: Small % missing, plenty of data remains

2 Impute

Fill with statistical values (mean, median, mode) or use regression models

Use when: Missing values are random, not systematic

3 Flag

Add a boolean column indicating "value was missing"

Use when: Missingness itself is informative

Orange Implementation

Use the Impute widget for GUI-based imputation, or Python Script for custom logic

# Python Script alternative for custom imputation
from Orange.data.pandas_compat import table_from_frame, table_to_frame

df = table_to_frame(in_data)

# Fill numeric columns with median
df['AcresBurned'] = df['AcresBurned'].fillna(df['AcresBurned'].median())

# Fill categorical with mode or 'Unknown'
df['Counties'] = df['Counties'].fillna('Unknown')

out_data = table_from_frame(df)

Dealing with Dates and Time

Extracting useful temporal features

Warning: Year should be removed from event data to prevent overfitting. Models can't predict future years based on past years alone.

Extract Useful Date Features Instead

Convert timestamps into meaningful categorical or cyclical features

import pandas as pd
from Orange.data.pandas_compat import table_from_frame, table_to_frame

df = table_to_frame(in_data)

# Convert to datetime
df['Started'] = pd.to_datetime(df['Started'])

# Extract useful features
df['Month'] = df['Started'].dt.month
df['DayOfWeek'] = df['Started'].dt.dayofweek
df['Hour'] = df['Started'].dt.hour

# Create season feature (1=Winter, 2=Spring, 3=Summer, 4=Fall)
df['Season'] = df['Month'].map({
    12:1, 1:1, 2:1,   # Winter
    3:2, 4:2, 5:2,    # Spring
    6:3, 7:3, 8:3,    # Summer
    9:4, 10:4, 11:4   # Fall
})

# Drop original date column
df = df.drop(columns=['Started'])

out_data = table_from_frame(df)
Why this works: Seasonal and time-of-day patterns are predictive, but specific years are not.

Adding External Data

Enriching with weather information

Exercise: Merge Weather Data

Download and merge: Daily weather in California (1998-2020)

Orange Implementation

Use the Merge Data widget to join datasets

  • Match on date column
  • Consider location matching
  • Handle missing weather data

Why External Data?

  • Weather conditions affect fire spread
  • Temperature, humidity, wind are predictive
  • Creates a richer feature space
  • Often dramatically improves model performance
Caution: Ensure weather data aligns temporally with fire start dates, not end dates.

Putting It All Together

Complete preparation pipeline

1
Clean
Duplicates
2
Impute
Missing values
3
Engineer
Date features
4
Merge
Weather data
5
Balance
Classes
6
Train
Model

Final Model Performance

Compare metrics before and after each enhancement

Feature Importance

Which features are most predictive?

Lessons Learned

What worked? What didn't? Why?

Key Takeaways

1

Understand your target first. Re-explore the data with your specific prediction question in mind.

2

Clean systematically. Duplicates, missing values, and outliers all affect model quality.

3

Feature engineering is critical. Extract useful information from dates, text, and complex fields.

4

External data adds value. Weather, terrain, and contextual data dramatically improve predictions.

Next steps: Apply these techniques to your own datasets and experiment with different feature engineering approaches.

Slide Overview