← Back to Data Exploration
Practical Work 1

Data Exploration and Preparation

Hands-on exercises with the California Fire Incidents dataset using Orange and Python

Duration 3-4 hours
Difficulty Intermediate
Session Data Preparation

Objectives

By the end of this practical work, you will be able to:

  • Load and explore a real-world dataset in Orange
  • Identify and handle missing values
  • Deal with imbalanced classes using resampling techniques
  • Extract useful features from date/time columns
  • Build and evaluate a basic classification model

Prerequisites

  • Orange Data Mining installed (download here)
  • Basic understanding of Python (optional, for advanced exercises)
  • The California Fire Incidents dataset

Install required Python packages (if using Python Script widget):

pip install pandas numpy scikit-learn

Instructions

Step 1: Load the Dataset

Download and load the California Fire Incidents dataset into Orange:

  1. Download the dataset: California_Fire_Incidents.csv
  2. In Orange, add a File widget to the canvas
  3. Double-click the widget and select the downloaded CSV file
  4. Connect a Data Table widget to view the data

Note: Take a moment to explore the dataset. How many rows and columns does it have? What types of features are present?

Step 2: Explore the Data

Add visualization widgets to understand the data:

  1. Connect a Distributions widget to see feature distributions
  2. Connect a Scatter Plot widget to explore relationships
  3. If you have geo add-on installed, use Geo Map to visualize fire locations

Questions to answer:

  • What is the distribution of MajorIncident (True vs False)?
  • Which counties have the most fires?
  • Is there a correlation between AcresBurned and PersonnelInvolved?

Step 3: Handle Missing Values

Use the Impute widget to handle missing values:

  1. Connect an Impute widget after your File widget
  2. Choose imputation strategy: Average/Most frequent for numeric/categorical
  3. Compare the data before and after imputation

Alternatively, use the Python Script widget for custom imputation:

from Orange.data.pandas_compat import table_from_frame, table_to_frame

df = table_to_frame(in_data)
# Fill numeric with median
df['AcresBurned'] = df['AcresBurned'].fillna(df['AcresBurned'].median())
# Fill categorical with 'Unknown'
df['Counties'] = df['Counties'].fillna('Unknown')
out_data = table_from_frame(df)

Step 4: Remove Duplicates

Use a Python Script widget to remove duplicate rows:

from Orange.data.pandas_compat import table_from_frame, table_to_frame

df = table_to_frame(in_data)
original_count = len(df)
df = df.drop_duplicates()
new_count = len(df)
print(f"Removed {original_count - new_count} duplicates")
out_data = table_from_frame(df)

Step 5: Handle Imbalanced Classes

The MajorIncident target is imbalanced (~66% False, ~34% True). Use undersampling:

import pandas as pd
from Orange.data.pandas_compat import table_from_frame, table_to_frame

df = table_to_frame(in_data)
# Separate by class
major = df[df['MajorIncident'] == True]
non_major = df[df['MajorIncident'] == False]
# Undersample majority class
non_major_sampled = non_major.sample(n=len(major), random_state=42)
# Combine
balanced_df = pd.concat([major, non_major_sampled])
print(f"Balanced dataset: {len(balanced_df)} rows")
out_data = table_from_frame(balanced_df)

Step 6: Feature Engineering - Date Features

Extract useful features from the Started date column:

import pandas as pd
from Orange.data.pandas_compat import table_from_frame, table_to_frame

df = table_to_frame(in_data)
# Convert to datetime
df['Started'] = pd.to_datetime(df['Started'], errors='coerce')
# Extract features
df['Month'] = df['Started'].dt.month
df['DayOfWeek'] = df['Started'].dt.dayofweek
df['Season'] = df['Month'].map({
    12:1, 1:1, 2:1,   # Winter
    3:2, 4:2, 5:2,    # Spring
    6:3, 7:3, 8:3,    # Summer
    9:4, 10:4, 11:4   # Fall
})
# Drop original date (prevents data leakage)
df = df.drop(columns=['Started', 'Extinguished'], errors='ignore')
out_data = table_from_frame(df)

Step 7: Build a Classification Model

Now build and evaluate a Random Forest model:

  1. Add a Select Columns widget to set MajorIncident as target
  2. Add a Random Forest widget (or other classifier)
  3. Add a Test and Score widget to evaluate
  4. Connect: Data → Select Columns → Test and Score ← Random Forest

Expected Output: You should see accuracy metrics. Aim for >70% accuracy with balanced data.

Expected Output

After completing this practical work, you should have:

  • A cleaned dataset with no missing values or duplicates
  • A balanced dataset with equal class distribution
  • New date-based features (Month, DayOfWeek, Season)
  • A trained Random Forest model with evaluation metrics

Deliverables

  • Orange Workflow: Save your Orange workflow (.ows file)
  • Cleaned Dataset: Export the processed data as CSV
  • Model Evaluation: Screenshot of Test and Score results
  • Report: Brief summary of your findings (1-2 paragraphs)

Bonus Challenges

  • Challenge 1: Merge with weather data (weather-sf.csv) to add climate features
  • Challenge 2: Try different classifiers (SVM, Neural Network) and compare results
  • Challenge 3: Use feature importance to identify the most predictive features
  • Challenge 4: Create visualizations showing fire patterns by season or location

Resources