Skip to content

Data Handling API Reference

API documentation for data processing and loading.

Dataset Classes

CancerDataset

mlops_project.data.CancerDataset

Bases: Dataset

HAM10000 skin lesion dataset.

__init__(data_path='../../data/raw/HAM10000', metadata_file='HAM10000_metadata.csv', transform=None, split_indices=None)

Initialize the ham10000 dataset.

Parameters:

Name Type Description Default
data_path str

Path to the data folder containing images/ and metadata/

'../../data/raw/HAM10000'
metadata_file str

Name of the metadata CSV file

'HAM10000_metadata.csv'
transform Callable | None

Optional transform to apply to images

None
split_indices list[int] | None

Optional list of indices for train/val/test split.

None

__len__()

Return the length of the dataset.

__getitem__(index)

Return a given sample from the dataset.

preprocess(output_folder, target_size=224)

Preprocess the raw data and save it to the output folder.

Data Functions

get_transforms

mlops_project.data.get_transforms(image_size=224, augment=True)

Get image transforms for training or validation.

Parameters:

Name Type Description Default
image_size int

Target size for images

224
augment bool

Whether to apply data augmentation (for training)

True

Returns:

Type Description
Compose

Composed transforms

preprocess

mlops_project.data.preprocess(data_path, output_folder, target_size=224)

Preprocess raw HAM10000 data.

Parameters:

Name Type Description Default
data_path str

Path to raw data folder

required
output_folder Path

Path to save preprocessed data

required
target_size int

Target image size

224

DataLoader Functions

create_dataloaders

mlops_project.dataloader.create_dataloaders(data_path, image_size=224, batch_size=32, num_workers=4, train_ratio=0.525, val_ratio=0.175, test_ratio=0.3, random_seed=42)

Create train, validation, and test dataloaders.

Parameters:

Name Type Description Default
data_path str

Path to the data folder (e.g., 'data/raw/ham10000')

required
image_size int

Target size for image resizing

224
batch_size int

Batch size for dataloaders

32
num_workers int

Number of workers for parallel data loading

4
train_ratio float

Fraction of data for training (default: 52.5%)

0.525
val_ratio float

Fraction of data for validation (default: 17.5%)

0.175
test_ratio float

Fraction of data for testing (default: 30%)

0.3
random_seed int

Random seed for reproducibility

42

Returns:

Type Description
tuple[DataLoader, DataLoader, DataLoader]

Tuple of (train_loader, val_loader, test_loader)

split_dataset_indices

mlops_project.dataloader.split_dataset_indices(metadata_path, train_ratio=0.525, val_ratio=0.175, test_ratio=0.3, random_seed=42)

Split dataset indices into train/val/test by lesion_id to avoid data leakage.

Parameters:

Name Type Description Default
metadata_path str

Path to the HAM10000_metadata.csv file

required
train_ratio float

Fraction of data for training (default: 52.5%)

0.525
val_ratio float

Fraction of data for validation (default: 17.5%)

0.175
test_ratio float

Fraction of data for testing (default: 30%)

0.3
random_seed int

Random seed for reproducibility

42

Returns:

Type Description
tuple[list[int], list[int], list[int]]

Tuple of (train_indices, val_indices, test_indices)

Note

Splits by lesion_id rather than image_id to prevent the same lesion from appearing in multiple splits (data leakage).

subsample_dataloader

mlops_project.dataloader.subsample_dataloader(data_path, subsample_result, image_size=224, batch_size=32, num_workers=4, train_ratio=0.525, val_ratio=0.175, test_ratio=0.3, random_seed=42)

Usage Examples

from mlops_project.data import CancerDataset, get_transforms
from mlops_project.dataloader import create_dataloaders

# Create dataset
transform = get_transforms(image_size=224, augment=True)
dataset = CancerDataset(data_dir="data/", transform=transform)

# Create dataloaders
train_loader, val_loader, test_loader = create_dataloaders(
    data_dir="data/",
    batch_size=32,
    image_size=224
)