Data Handling API Reference¶
API documentation for data processing and loading.
Dataset Classes¶
CancerDataset¶
mlops_project.data.CancerDataset
¶
Bases: Dataset
HAM10000 skin lesion dataset.
__init__(data_path='../../data/raw/HAM10000', metadata_file='HAM10000_metadata.csv', transform=None, split_indices=None)
¶
Initialize the ham10000 dataset.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data_path
|
str
|
Path to the data folder containing images/ and metadata/ |
'../../data/raw/HAM10000'
|
metadata_file
|
str
|
Name of the metadata CSV file |
'HAM10000_metadata.csv'
|
transform
|
Callable | None
|
Optional transform to apply to images |
None
|
split_indices
|
list[int] | None
|
Optional list of indices for train/val/test split. |
None
|
__len__()
¶
Return the length of the dataset.
__getitem__(index)
¶
Return a given sample from the dataset.
preprocess(output_folder, target_size=224)
¶
Preprocess the raw data and save it to the output folder.
Data Functions¶
get_transforms¶
mlops_project.data.get_transforms(image_size=224, augment=True)
¶
Get image transforms for training or validation.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
image_size
|
int
|
Target size for images |
224
|
augment
|
bool
|
Whether to apply data augmentation (for training) |
True
|
Returns:
| Type | Description |
|---|---|
Compose
|
Composed transforms |
preprocess¶
mlops_project.data.preprocess(data_path, output_folder, target_size=224)
¶
Preprocess raw HAM10000 data.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data_path
|
str
|
Path to raw data folder |
required |
output_folder
|
Path
|
Path to save preprocessed data |
required |
target_size
|
int
|
Target image size |
224
|
DataLoader Functions¶
create_dataloaders¶
mlops_project.dataloader.create_dataloaders(data_path, image_size=224, batch_size=32, num_workers=4, train_ratio=0.525, val_ratio=0.175, test_ratio=0.3, random_seed=42)
¶
Create train, validation, and test dataloaders.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data_path
|
str
|
Path to the data folder (e.g., 'data/raw/ham10000') |
required |
image_size
|
int
|
Target size for image resizing |
224
|
batch_size
|
int
|
Batch size for dataloaders |
32
|
num_workers
|
int
|
Number of workers for parallel data loading |
4
|
train_ratio
|
float
|
Fraction of data for training (default: 52.5%) |
0.525
|
val_ratio
|
float
|
Fraction of data for validation (default: 17.5%) |
0.175
|
test_ratio
|
float
|
Fraction of data for testing (default: 30%) |
0.3
|
random_seed
|
int
|
Random seed for reproducibility |
42
|
Returns:
| Type | Description |
|---|---|
tuple[DataLoader, DataLoader, DataLoader]
|
Tuple of (train_loader, val_loader, test_loader) |
split_dataset_indices¶
mlops_project.dataloader.split_dataset_indices(metadata_path, train_ratio=0.525, val_ratio=0.175, test_ratio=0.3, random_seed=42)
¶
Split dataset indices into train/val/test by lesion_id to avoid data leakage.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
metadata_path
|
str
|
Path to the HAM10000_metadata.csv file |
required |
train_ratio
|
float
|
Fraction of data for training (default: 52.5%) |
0.525
|
val_ratio
|
float
|
Fraction of data for validation (default: 17.5%) |
0.175
|
test_ratio
|
float
|
Fraction of data for testing (default: 30%) |
0.3
|
random_seed
|
int
|
Random seed for reproducibility |
42
|
Returns:
| Type | Description |
|---|---|
tuple[list[int], list[int], list[int]]
|
Tuple of (train_indices, val_indices, test_indices) |
Note
Splits by lesion_id rather than image_id to prevent the same lesion from appearing in multiple splits (data leakage).
subsample_dataloader¶
mlops_project.dataloader.subsample_dataloader(data_path, subsample_result, image_size=224, batch_size=32, num_workers=4, train_ratio=0.525, val_ratio=0.175, test_ratio=0.3, random_seed=42)
¶
Usage Examples¶
from mlops_project.data import CancerDataset, get_transforms
from mlops_project.dataloader import create_dataloaders
# Create dataset
transform = get_transforms(image_size=224, augment=True)
dataset = CancerDataset(data_dir="data/", transform=transform)
# Create dataloaders
train_loader, val_loader, test_loader = create_dataloaders(
data_dir="data/",
batch_size=32,
image_size=224
)