Dataset Preparation

By Example

Session 2

Hands-on data exploration and feature engineering using California fire incidents

Interactive Data Exploration

Learning by doing with real-world datasets

This Session is Interactive

A main exploration scenario is prepared, but you can propose your own ideas and experiments along the way

Dataset

California Fire Incidents

Real data about fire incidents in California, including location, resources deployed, and outcomes

Download Dataset

Learning Goals

Explore and understand features
Handle missing values and duplicates
Engineer new features from dates
Deal with imbalanced data
Merge external datasets

California Fire Incidents Schema

Understanding the key features

Target & Location

`MajorIncident`	Target variable (boolean)
`Latitude`	GPS coordinate
`Longitude`	GPS coordinate
`Counties`	Geographic region
`AcresBurned`	Fire size metric

Resources & Timeline

`Started`	Fire start date/time
`Extinguished`	Fire end date/time
`PersonnelInvolved`	Count of firefighters
`Engines`	Number deployed
`Helicopters`	Number deployed
`StructuresDestroyed`	Damage count

First task: Load this dataset in Orange and explore its structure.

Dataset Discovery

Understanding composition and quality

1 Understand Structure

How many rows? How many columns? What data types?

2 Identify Questions

What questions could this dataset answer?

3 Assess Quality

Are there missing values? Duplicates? Outliers?

4 Feature Analysis

Which features relate to potential target variables?

Exercise: Take time to explore the dataset visually using Orange's Data Table and Distributions widgets before proceeding.

Our Prediction Question

Defining the machine learning task

Can we predict if a fire will become a major incident?

The target is defined by the MajorIncident feature (boolean)

Re-explore the Data

Now that you know the target, revisit the dataset with this specific question in mind

Visualize Relationships

Use scatterplots, distributions, and geomaps to understand patterns

First Filtering

Remove duplicates and obviously irrelevant features

Orange tip: Install the Geo add-on to visualize fire locations on a map.

Removing Duplicate Records

Using pandas in Orange's Python Script widget

Orange doesn't have a built-in deduplicate widget, so we use a Python Script with pandas

from Orange.data.pandas_compat import table_from_frame, table_to_frame

# Convert Orange table to pandas DataFrame
df = table_to_frame(in_data)

# Remove duplicate rows
df = df.drop_duplicates()

# Convert back to Orange table
out_data = table_from_frame(df)

Result: This removes exact duplicate rows while preserving the rest of your data.

Target Variable Balance

Assessing class distribution

The Numbers

66%

Non-major incidents (False)

34%

Major incidents (True)

Imbalanced Dataset

This is imbalanced, but not critically so

Ratio: approximately 2:1 (non-major to major)

The Problem

A model that always predicts "False" would achieve 66% accuracy

This baseline affects some algorithms' learning

Impact: Imbalanced datasets can bias probability estimates and reduce model performance on the minority class.

Handling Imbalanced Classes

Strategies to balance your dataset

Oversampling

Duplicate minority class samples to match majority class size

Pro: No data loss

Con: Risk of overfitting

Undersampling

Randomly remove majority class samples to match minority class

Pro: Fast, balanced dataset

Con: Data loss

Class Weights

Give higher penalty to misclassifying minority class

Pro: No data modification

Con: Not all algorithms support it

Collect More Data

Find additional minority class examples

Pro: Best solution

Con: Often not feasible

For this dataset: We'll use undersampling to create a balanced training set.

Implementing Undersampling

Python Script widget in Orange

Use Orange's Python Script widget with this code to balance the classes

import pandas as pd
from Orange.data.pandas_compat import table_from_frame, table_to_frame

# Convert to pandas
df = table_to_frame(in_data)

# Separate by class
major = df[df['MajorIncident'] == True]
non_major = df[df['MajorIncident'] == False]

# Undersample majority class to match minority
non_major_sampled = non_major.sample(n=len(major), random_state=42)

# Combine both classes
balanced_df = pd.concat([major, non_major_sampled])

# Shuffle the dataset
balanced_df = balanced_df.sample(frac=1, random_state=42).reset_index(drop=True)

# Convert back to Orange table
out_data = table_from_frame(balanced_df)

Result: Both classes now have equal representation in the training data.

First Model Evaluation

Testing with Random Forest

1

Load

Raw data

2

Clean

Duplicates

3

Balance

Undersample

4

Train

Random Forest

5

Evaluate

Metrics

Are you happy with the results?

Look at accuracy, precision, recall, and F1-score

Check Feature Importance

Which features contribute most to predictions?

Should features be selected?

Can we remove irrelevant or redundant features?

What extra data could help?

Weather? Terrain? Vegetation data?

Dealing with Missing Values

Three main strategies when you cannot recover data

1 Remove

Drop rows or columns with missing values

Use when: Small % missing, plenty of data remains

2 Impute

Fill with statistical values (mean, median, mode) or use regression models

Use when: Missing values are random, not systematic

3 Flag

Add a boolean column indicating "value was missing"

Use when: Missingness itself is informative

Orange Implementation

Use the Impute widget for GUI-based imputation, or Python Script for custom logic

# Python Script alternative for custom imputation
from Orange.data.pandas_compat import table_from_frame, table_to_frame

df = table_to_frame(in_data)

# Fill numeric columns with median
df['AcresBurned'] = df['AcresBurned'].fillna(df['AcresBurned'].median())

# Fill categorical with mode or 'Unknown'
df['Counties'] = df['Counties'].fillna('Unknown')

out_data = table_from_frame(df)

Dealing with Dates and Time

Extracting useful temporal features

Warning: Year should be removed from event data to prevent overfitting. Models can't predict future years based on past years alone.

Extract Useful Date Features Instead

Convert timestamps into meaningful categorical or cyclical features

import pandas as pd
from Orange.data.pandas_compat import table_from_frame, table_to_frame

df = table_to_frame(in_data)

# Convert to datetime
df['Started'] = pd.to_datetime(df['Started'])

# Extract useful features
df['Month'] = df['Started'].dt.month
df['DayOfWeek'] = df['Started'].dt.dayofweek
df['Hour'] = df['Started'].dt.hour

# Create season feature (1=Winter, 2=Spring, 3=Summer, 4=Fall)
df['Season'] = df['Month'].map({
    12:1, 1:1, 2:1,   # Winter
    3:2, 4:2, 5:2,    # Spring
    6:3, 7:3, 8:3,    # Summer
    9:4, 10:4, 11:4   # Fall
})

# Drop original date column
df = df.drop(columns=['Started'])

out_data = table_from_frame(df)

Why this works: Seasonal and time-of-day patterns are predictive, but specific years are not.

Adding External Data

Enriching with weather information

Exercise: Merge Weather Data

Download and merge: Daily weather in California (1998-2020)

Orange Implementation

Use the Merge Data widget to join datasets

Match on date column
Consider location matching
Handle missing weather data

Why External Data?

Weather conditions affect fire spread
Temperature, humidity, wind are predictive
Creates a richer feature space
Often dramatically improves model performance

Caution: Ensure weather data aligns temporally with fire start dates, not end dates.

Putting It All Together

Complete preparation pipeline

1

Clean

Duplicates

2

Impute

Missing values

3

Engineer

Date features

4

Merge

Weather data

5

Balance

Classes

6

Train

Model

Final Model Performance

Compare metrics before and after each enhancement

Feature Importance

Which features are most predictive?

Lessons Learned

What worked? What didn't? Why?

Key Takeaways

1

Understand your target first. Re-explore the data with your specific prediction question in mind.

2

Clean systematically. Duplicates, missing values, and outliers all affect model quality.

3

Feature engineering is critical. Extract useful information from dates, text, and complex fields.

4

External data adds value. Weather, terrain, and contextual data dramatically improve predictions.

Next steps: Apply these techniques to your own datasets and experiment with different feature engineering approaches.

Dataset Preparation

Interactive Data Exploration

This Session is Interactive

Dataset

Learning Goals

California Fire Incidents Schema

Target & Location

Resources & Timeline

Dataset Discovery

1 Understand Structure

2 Identify Questions

3 Assess Quality

4 Feature Analysis

Our Prediction Question

Can we predict if a fire will become a major incident?

Re-explore the Data

Visualize Relationships

First Filtering

Removing Duplicate Records

Target Variable Balance

The Numbers

Imbalanced Dataset

The Problem

Handling Imbalanced Classes

Oversampling

Undersampling

Class Weights

Collect More Data

Implementing Undersampling

First Model Evaluation

Are you happy with the results?

Check Feature Importance

Should features be selected?

What extra data could help?

Dealing with Missing Values

1 Remove

2 Impute

3 Flag

Orange Implementation

Dealing with Dates and Time

Extract Useful Date Features Instead

Adding External Data

Exercise: Merge Weather Data

Orange Implementation

Why External Data?

Putting It All Together

Final Model Performance

Feature Importance

Lessons Learned

Key Takeaways

Slide Overview