ENH: Minimize frame classification dataset size, fix #717 (#718)

* Change function vak.prep.frame_classification.dataset_arrays.make_npy_files_for_each_split to remove spectrogram/audio files from dataset path after making the npy files * Modify prep_spectrogram_dataset so that it no longer makes a directory 'spectrogram_generated_{timenow} -- that way we don't have to delete the directory when we remove the spectrograms after converting to npy files later * Rename get_train_dur_replicate_split_name -> get_train_dur_replicate_subset_name in src/vak/common/learncurve.py * Modify src/vak/prep/frame_classification/learncurve.py to no longer make duplicate npy files for each subset names, and to add subset names in a separate column from split so that we can specify subsets directly in learncurve * Add subset parameter to src/vak/datasets/frame_classification/frames_dataset.py, that takes precedence over split parameter when selecting part of dataframe to use for grabbing samples * Add subset parameter to src/vak/datasets/frame_classification/window_dataset.py, that takes precedence over split parameter when selecting part of dataframe to use for grabbing samples * Rename split parameter of vak.train.frame_classification to subset, and use when making training dataset instance * Use subset inside of src/vak/learncurve/frame_classification.py * Have StandardizeSpect.fit_dataset_path take subset argument and have it take precedence over split when fitting, as with dataset classes * Use split + subset when calling StandardizeSpect.fit_dataset_path in src/vak/train/frame_classification.py * Use subset not split argument when calling training functions for model families in src/vak/train/train_.py * WIP: Use subset with ParametricUMAPDataset (haven't added argument to dataset class yet) * Add function `make_index_vectors_for_each_subset` to src/vak/prep/frame_classification/learncurve.py, rename `make_learncurve_splits` to `make_subsets_from_dataset_df` and have it call `make_index_vectors` * Revise a couple things in docstring in src/vak/prep/frame_classification/dataset_arrays.py * Have audio_format default to none in src/vak/prep/frame_classification/dataset_arrays.py and raise ValueError if input_type is audio but audio_format is None * Fix parameter order of function in src/vak/prep/frame_classification/learncurve.py to match order of dataset_arrays so it's not confusing, and set default of audio_format to None, raise a ValueError if input_type is audio but audio_format is None * In src/vak/prep/frame_classification/frame_classification.py, call make_subsets_from_data_df with correct arguments (now renamed from make_learncurve_splits_from_dataset_df) * Add src/vak/datasets/frame_classification/helper.py with helper functions that return filenames of indexing vectors for subsets of (training) data * Import helper in src/vak/datasets/frame_classification/__init__.py * Use helper functions to load indexing vectors for subsets in classmethod of src/vak/datasets/frame_classification/window_dataset.py * Use helper functions to load indexing vectors for subsets in classmethod of src/vak/datasets/frame_classification/frames_dataset.py * Rewrite functions in src/vak/prep/frame_classification/frame_classification.py -- realize I can just use frame npy files to make indexing vectors, so I don't need input type, audio format, etc. * Fix args to make_indes_vecotrs_for_each_subset and fix how we concatenate dataset_df in src/vak/prep/frame_classification/learncurve.py * Fix how we use subset in FramesDataset.__init__ * Fix how we use subset in WindowDataset.__init__ * Change word 'split' -> 'subset' in src/vak/learncurve/frame_classification.py * Fix docstrings in src/vak/datasets/frame_classification/window_dataset.py * Fix docstrings in src/vak/datasets/frame_classification/frames_dataset.py * Fix a typo in a docstring in src/vak/datasets/frame_classification/window_dataset.py * Fix subset parameter of classmethod for ParametricUMAPDataset class; move logic from classmethod into __init__ although I'm not sure this is a good idea * Rename frame_classification/dataset_arrays.py to frame_classification/make_splits.py and rewrite 'make_npy_paths' as 'make_splits', have it move/copy/create audio or spectrogram files in split dirs, in addition to making npy files, and update the 'audio_path' or 'spect_path' columns with the files in the split dirs * Remove constants from src/vak/datasets/frame_classification/constants.py that are no longer used for 'frames' files * Use make_splits function in src/vak/prep/frame_classification/frame_classification.py * Modify make_dataframe_of_spect_files function in src/vak/prep/spectrogram_dataset/spect_helper.py so it no longer converts mat files into npz files, instead it just finds/collates all the spect files and returns them in the dataframe; any converting is done by frame_classification.make_splits with the output of this function * Fix typo in list comprehension and add info to docstring in src/vak/prep/frame_classification/make_splits.py * Fix imports in src/vak/prep/frame_classification/__init__.py after renaming module to 'make_splits' * Remove other occurrences of 'spect_output_dir' from src/vak/prep/spectrogram_dataset/spect_helper.py, no longer is a parameter and not used * No longer pass 'spect_output_dir' into 'prep_spectrogram_dataset' in src/vak/prep/spectrogram_dataset/prep.py * Remove unused import in src/vak/prep/spectrogram_dataset/spect_helper.py * Add logger statement in src/vak/prep/frame_classification/make_splits.py * Fix src/vak/prep/frame_classification/learncurve.py so functions use either spect or audio to get frames and make indexing vectors * Fix src/vak/prep/frame_classification/frame_classification.py so we pass needed parameters into make_subsets_from_dataset_df * Make x_path relative to dataset_path in src/vak/prep/frame_classification/frame_classification.py, since that's what downstream functions/classes expect * Rename x_path -> source_path in src/vak/prep/frame_classification/make_splits.py * Rename x_path -> source_path in src/vak/prep/frame_classification/learncurve.py * Rewrite frame_classification.WindowDataset to load audio/spectrograms directly from 'frame_paths' * Add FRAMES_PATH_COL_NAME to src/vak/datasets/frame_classification/constants.py * Rewrite make_splits.py to add frames_path column to dataframe, and have frame_classification models use that column always; this way we keep the original 'audio_path' and 'spect_path' columns as metadata, and avoid if/else logic everywhere in dataset classes * Fix WindowDataset to use constant to load frame paths column, and to validate input type, revise docstring * Fix FramesDataset the same way as WindowDataset: load frame paths with constant, load inside __getitem__ with helper function _load_frames, validate input type, fix order of attributes in docstring * Use self.dataset_path to build frames_path in WindowDataset * Use self.dataset_path to build frames_path in FramesDataset, and pass into transform as 'frames_path', not 'source_path' * Rename 'source_path' -> 'frames_path' inside src/vak/transforms/defaults/frame_classification.py * Rename 'source_path' -> 'frames_path' in FrameClassificationModel methods, in src/vak/models/frame_classification_model.py * Rename 'source_path' -> 'frames_path' in src/vak/predict/frame_classification.py * Add SPECT_KEY to common.constants * Fix how StandardizeSpect.from_dataset_path builds frames_path paths, and use constants.SPECT_KEY when loading from frames path * Use common.constants.SPECT_KEY inside _load_frames method of WindowDataset * Use common.constants.SPECT_KEY inside _load_frames method of FramesDataset * Add newline at end of src/vak/common/constants.py * Add FRAME_CLASSIFICATION_DATASET_AUDIO_FORMAT to src/vak/datasets/frame_classification/constants.py * Add function load_frames to src/vak/datasets/frame_classification/helper.py * Have WindowDataset._load_frames use helper.load_frames * Have FramesDataset._load_frames use helper.load_frames * Rename GENERATED_TEST_DATA -> GENERATED_TEST_DATA_ROOT in tests/scripts/vaktestdata/constants.py * Rename GENERATED_TEST_DATA -> GENERATED_TEST_DATA_ROOT in tests/scripts/vaktestdata/dirs.py * Add tests/scripts/vaktestdata/spect.py * import spect module in tests/scripts/vaktestdata/__init__.py * Call vaktestdata.spect.prep_spects in prep section of script tests/scripts/generate_data_for_tests.py * Fix spect_dir_npz fixture in tests/fixtures/spect.py to use directory of just .spect.npz files that is now generated by the generate_test_data script * Add SPECT_NPZ_EXTENSION to src/vak/common/constants.py * Use common.SPECT_NPZ_EXTENSION in src/vak/prep/spectrogram_dataset/audio_helper.py * Fix prep.frame_classification.make_splits to remove any .spect.npz files remaining in dataset_path, that were not moved into splits * Fix vak.prep.frame_classification.learncurve.make_index_vectors_for_subsets to use frame_paths column instead of 'source' paths (audio_path or spect_path) -- so we are using files that definitely exist and are already assigned to splits * WIP: Rewriting unit tests in tests/test_prep/test_frame_classification/test_learncurve.py * WIP: Rewriting unit tests in tests/test_prep/test_frame_classification/test_make_splits.py * WIP: Add tests/test_datasets/test_frame_classification/test_helper.py * Rename specific_config -> specific_config_toml_path * WIP: Rewriting tests/test_prep/test_frame_classification/test_make_splits.py * Add src/vak/prep/frame_classification/get_or_make_source_files.py * Add src/vak/prep/frame_classification/assign_samples_to_splits.py * Rewrite 'prep_frame_classification_dataset' to use helper functions factored out into other modules: get_or_make_source_files and assign_samples_to_splits * Capitalize in docstring in src/vak/prep/spectrogram_dataset/prep.py * Add TIMEBINS_KEY to src/vak/common/constants.py * Finish fixing unit test for vak.prep.frame_classification.make_splits * Add imports in src/vak/prep/frame_classification/__init__.py * Revise docstring of src/vak/prep/audio_dataset.py to refer to 'source_files_df' * Revise docstring of src/vak/prep/spectrogram_dataset/spect_helper.py to refer to 'source_files_df' * Revise docstring of src/vak/prep/spectrogram_dataset/prep.py to refer to 'source_files_df' * Revise src/vak/prep/frame_classification/get_or_make_source_files.py to refer to 'source_files_df', in docstring and inside function * In 'prep_frame_classification_dataset', differentiate between 'source_files_df' and 'dataset_df' * Delete birdsong-recognition-dataset configs from tests/data_for_tests/configs * Fix a docstring in noxfile.py * Remove tests/scripts/vaktestdata/spect.py * Add model_family field in tests/data_for_tests/configs/configs.json, remove configs for birdsong-recognition-dataset * Add model_family field to ConfigMetadata dataclass in tests/scripts/vaktestdata/config_metadata.py * Remove call to vaktestdata.spect.prep_spects() since we are going to call other functions that will make spectrograms * Change parameters order of frame_classification.get_or_make_source_files, add pre-conditions/validators * Fix order of args to get_or_make_source_files in src/vak/prep/frame_classification/frame_classification.py * Add more to docstring of src/vak/prep/frame_classification/get_or_make_source_files.py * Add 'spect_output_dir' and 'data_dir' fields to tests/data_for_tests/configs/configs.json * Rewrite ConfigMetadata dataclass, add docstring and converters, add spect_output_dir and data_dir attributes * Add functions to make more directories in tests/data_for_tests/generated in tests/scripts/vaktestdata/dirs.py * Import get_or_make_source_files in tests/scripts/vaktestdata/__init__.py * Add more constants with names of directories to make in tests/data_for_tests/generated in tests/scripts/vaktestdata/constants.py * Add tests/scripts/vaktestdata/get_or_make_source_files.py * Add 'spect-output-dir/' to data_dir paths in tests/data_for_tests/configs/configs.json * Rename tests/scripts/vaktestdata/get_or_make_source_files.py -> tests/scripts/vaktestdata/source_files.py, rewrite function that makes source files + csv files we use with tests * Fix tests/scripts/vaktestdata/__init__.py to import source_files module, remove import of get_or_make_source_files module that was renamed to source_files * Import missing module constants and fix order of arguments to prep_spectrogram_dataset in src/vak/prep/frame_classification/get_or_make_source_files.py * Change 3 configs to have spect_format option set to npz * Remove import of module spect in tests/scripts/vaktestdata/__init__.py * Flesh out function in tests/scripts/vaktestdata/source_files.py * Add log statements in tests/scripts/generate_data_for_tests.py * Fix typo in src/vak/prep/frame_classification/get_or_make_source_files.py * Add SPECT_FORMAT_EXT_MAP to src/vak/common/constants.py * Use vak.commonconstants.SPECT_FORMAT_EXT_MAP in src/vak/prep/spectrogram_dataset/prep.py so that we correctly remove source file extension to pair with annotation file * Fix attributes of ConfigMetadata so we don't convert None to 'None' * Copy annotation files to spect_output_dir so we can prep from that dir, in tests/scripts/vaktestdata/source_files.py * Change name of logger in tests/scripts/generate_data_for_tests.py * Fix attributes in ConfigMetadata so we don't convert strings to bool * Remove fixtures from tests/fixtures/annot.py after removing corresponding source data * Fix import in src/vak/prep/frame_classification/__init__.py * Fix import in src/vak/prep/frame_classification/frame_classification.py * Add tests/fixtures/source_files with fixtures to get csv files * Add fixtures that return dataframes directly in tests/fixtures/source_files.py * Add tests/test_prep/test_frame_classification/test_get_or_make_source_files.py * Add tests/test_prep/test_frame_classification/test_assign_samples_to_splits.py * Fix factory functions in tests/fixtures/source_files.py * Fix assembled path in tests/fixtures/source_files.py * Fix unit test in tests/test_prep/test_frame_classification/test_make_splits.py to use fixture so it's faster and less verbose * Remove fixtures that no longer exist from specific_annot_list fixture in tests/fixtures/annot.py * Remove fixtures for data that doesn't exist in tests/fixtures/audio.py * Remove birdsong-rec from parametrize in tests/test_cli/test_predict.py * Remove birdsongrec from parametrize in tests/test_cli/test_prep.py * Remove birdsongrec from parametrize in tests/test_cli/test_train.py * Remove birdsongrec and other data no longer in source from parametrizes in tests/test_common/test_annotation.py * Remove birdsongrec from parametrize in tests/test_predict/test_frame_classification.py * Remove birdsongrec from parametrize in tests/test_prep/test_frame_classification/test_frame_classification.py * Remove birdsongrec from parametrize in tests/test_prep/test_prep.py * Remove birdsongrec from parametrize in tests/test_prep/test_sequence_dataset.py * Remove birdsongrec from parametrize in tests/test_train/test_frame_classification.py * Remove birdsongrec from parametrize in tests/test_train/test_train.py * Remove unit tests from tests/test_common/test_files/test_files.py that test on data removed from source data * Remove parametrize that uses wav/textgrid data removed from source data * Fix fixture in tests/fixtures/spect.py * Actually write unit tests in tests/test_datasets/test_frame_classification/test_helper.py * Fix prep.frame_classification.make_splits to not convert frame labels npy paths to 'None' when they are None * Fix assert helper in tests/test_prep/test_frame_classification/test_frame_classification.py * Remove spect_key and audio_format parameters from functions in src/vak/prep/frame_classification/learncurve.py, no longer used * Change order of params for make_subsets_from_dataset_df * Change order of args in call to make_subsets_from_dataset_df inside prep_fram_classification_dataset * Rename some variables to 'subset_df' in src/vak/prep/frame_classification/learncurve.py and revise docstrings * Finish adding/fixing unit tests in tests/test_prep/test_frame_classification/test_learncurve.py * Fix bug in unit test in tests/test_prep/test_frame_classification/test_make_splits.py * Fix unit tests in tests/test_prep/test_spectrogram_dataset/test_prep.py * Fix unit test in tests/test_prep/test_spectrogram_dataset/test_spect_helper.py * Fix unit test in tests/test_transforms/test_transforms.py * Use torch.testing.assert_close instead of assert_allclose in tests/test_nn/test_loss/test_dice.py
vocalpy · Oct 10, 2023 · cbc3f82 · cbc3f82
1 parent 983d231
commit cbc3f82
Show file tree

Hide file tree

Showing 81 changed files with 2,544 additions and 1,209 deletions.
diff --git a/noxfile.py b/noxfile.py
@@ -157,7 +157,7 @@ def copy_url(url: str, path: str) -> None:
 
 @nox.session(name='test-data-tar-source')
 def test_data_tar_source(session) -> None:
-    """Make a .tar.gz file of just the 'generated' test data used to run tests on CI."""
+    """Make a .tar.gz file of just the 'source' test data used to run tests."""
     session.log(f"Making tarfile with source data: {SOURCE_TEST_DATA_TAR}")
     make_tarfile(SOURCE_TEST_DATA_TAR, SOURCE_TEST_DATA_DIRS)
 

diff --git a/src/vak/common/constants.py b/src/vak/common/constants.py
@@ -42,3 +42,17 @@
 # ---- output (default) file extensions. Using the `pathlib` name "suffix" ----
 ANNOT_CSV_SUFFIX = ".annot.csv"
 NET_OUTPUT_SUFFIX = ".output.npz"
+
+# ---- the key for loading the spectrogram matrix from an npz file
+# TODO: replace this with vocalpy constants when we move to VocalPy
+SPECT_KEY = "s"
+TIMEBINS_KEY = "t"
+
+# TODO: replace this with vocalpy extension when we move to VocalPy
+# ---- the extension used to save spectrograms in npz array files
+# used by :func:`vak.prep.spectrogram_dataset.audio_helper.make
+SPECT_NPZ_EXTENSION = ".spect.npz"
+SPECT_FORMAT_EXT_MAP = {
+    "npz": SPECT_NPZ_EXTENSION,
+    "mat": ".mat",
+}
diff --git a/src/vak/common/learncurve.py b/src/vak/common/learncurve.py
@@ -1,10 +1,10 @@
-def get_train_dur_replicate_split_name(
+def get_train_dur_replicate_subset_name(
     train_dur: int, replicate_num: int
 ) -> str:
-    """Get name of a training set split for a learning curve,
+    """Get name of a training set subset for a learning curve,
     for a specified training duration and replicate number.
 
-    Used when preparing the training set splits for a learning curve,
+    Used when preparing the training set subsets for a learning curve,
     and when training models to generate the results for the curve.
     """
     return f"train-dur-{float(train_dur)}-replicate-{int(replicate_num)}"
diff --git a/src/vak/datasets/frame_classification/__init__.py b/src/vak/datasets/frame_classification/__init__.py
@@ -1,6 +1,6 @@
-from . import constants
+from . import constants, helper
 from .frames_dataset import FramesDataset
 from .metadata import Metadata
 from .window_dataset import WindowDataset
 
-__all__ = ["constants", "Metadata", "FramesDataset", "WindowDataset"]
+__all__ = ["constants", "helper", "Metadata", "FramesDataset", "WindowDataset"]
diff --git a/src/vak/datasets/frame_classification/constants.py b/src/vak/datasets/frame_classification/constants.py
@@ -1,8 +1,8 @@
-FRAMES_ARRAY_EXT = ".frames.npy"
-FRAMES_NPY_PATH_COL_NAME = "frames_npy_path"
+FRAMES_PATH_COL_NAME = "frames_path"
 FRAME_LABELS_EXT = ".frame_labels.npy"
 FRAME_LABELS_NPY_PATH_COL_NAME = "frame_labels_npy_path"
 ANNOTATION_CSV_FILENAME = "y.csv"
 SAMPLE_IDS_ARRAY_FILENAME = "sample_ids.npy"
 INDS_IN_SAMPLE_ARRAY_FILENAME = "inds_in_sample.npy"
 WINDOW_INDS_ARRAY_FILENAME = "window_inds.npy"
+FRAME_CLASSIFICATION_DATASET_AUDIO_FORMAT = "wav"
diff --git a/src/vak/datasets/frame_classification/frames_dataset.py b/src/vak/datasets/frame_classification/frames_dataset.py
@@ -1,3 +1,6 @@
+"""A dataset class used for neural network models with the
+frame classification task, where the source data consists of audio signals
+or spectrograms of varying lengths."""
 from __future__ import annotations
 
 import pathlib
@@ -7,8 +10,9 @@
 import numpy.typing as npt
 import pandas as pd
 
-from . import constants
+from . import constants, helper
 from .metadata import Metadata
+from ... import common
 
 
 class FramesDataset:
@@ -20,49 +24,127 @@ class FramesDataset:
 
     Attributes
     ----------
-    dataset_path
-    dataset_df
-    frame_dur : float
-        Duration of a single frame, in seconds.
-    duration : float
-        Total duration of the dataset.
+    dataset_path : pathlib.Path
+        Path to directory that represents a
+        frame classification dataset,
+        as created by
+        :func:`vak.prep.prep_frame_classification_dataset`.
+    split : str
+        The name of a split from the dataset,
+        one of {'train', 'val', 'test'}.
+    subset : str, optional
+        Name of subset to use.
+        If specified, this takes precedence over split.
+        Subsets are typically taken from the training data
+        for use when generating a learning curve.
+    dataset_df : pandas.DataFrame
+        A frame classification dataset,
+        represented as a :class:`pandas.DataFrame`.
+        This will be only the rows that correspond
+        to either ``subset`` or ``split`` from the
+        ``dataset_df`` that was passed in when
+        instantiating the class.
+    frames_paths : numpy.ndarray
+        Paths to npy files containing frames,
+        either spectrograms or audio signals
+        that are input to the model.
+    frame_labels_paths : numpy.ndarray
+        Paths to npy files containing vectors
+        with a label for each frame.
+        The targets for the outputs of the model.
+    input_type : str
+        The type of input to the neural network model.
+        One of {'audio', 'spect'}.
+    sample_ids : numpy.ndarray
+        Indexing vector representing which sample
+        from the dataset every frame belongs to.
+    inds_in_sample : numpy.ndarray
+        Indexing vector representing which index
+        within each sample from the dataset
+        that every frame belongs to.
+    frame_dur: float
+        Duration of a frame, i.e., a single sample in audio
+        or a single timebin in a spectrogram.
+    item_transform : callable, optional
+        Transform applied to each item :math:`(x, y)`
+        returned by :meth:`FramesDataset.__getitem__`.
     """
 
     def __init__(
         self,
         dataset_path: str | pathlib.Path,
         dataset_df: pd.DataFrame,
+        input_type: str,
         split: str,
         sample_ids: npt.NDArray,
         inds_in_sample: npt.NDArray,
         frame_dur: float,
-        input_type: str,
+        subset: str | None = None,
         item_transform: Callable | None = None,
     ):
-        self.dataset_path = pathlib.Path(dataset_path)
+        """Initialize a new instance of a FramesDataset.
 
+        Parameters
+        ----------
+        dataset_path : pathlib.Path
+            Path to directory that represents a
+            frame classification dataset,
+            as created by
+            :func:`vak.prep.prep_frame_classification_dataset`.
+        dataset_df : pandas.DataFrame
+            A frame classification dataset,
+            represented as a :class:`pandas.DataFrame`.
+        input_type : str
+            The type of input to the neural network model.
+            One of {'audio', 'spect'}.
+        split : str
+            The name of a split from the dataset,
+            one of {'train', 'val', 'test'}.
+        sample_ids : numpy.ndarray
+            Indexing vector representing which sample
+            from the dataset every frame belongs to.
+        inds_in_sample : numpy.ndarray
+            Indexing vector representing which index
+            within each sample from the dataset
+            that every frame belongs to.
+        frame_dur: float
+            Duration of a frame, i.e., a single sample in audio
+            or a single timebin in a spectrogram.
+        subset : str, optional
+            Name of subset to use.
+            If specified, this takes precedence over split.
+            Subsets are typically taken from the training data
+            for use when generating a learning curve.
+        item_transform : callable, optional
+            Transform applied to each item :math:`(x, y)`
+            returned by :meth:`FramesDataset.__getitem__`.
+        """
+        from ... import prep  # avoid circular import, use for constants.INPUT_TYPES
+        if input_type not in prep.constants.INPUT_TYPES:
+            raise ValueError(
+                f"``input_type`` must be one of: {prep.constants.INPUT_TYPES}\n"
+                f"Value for ``input_type`` was: {input_type}"
+            )
+
+        self.dataset_path = pathlib.Path(dataset_path)
         self.split = split
-        dataset_df = dataset_df[dataset_df.split == split].copy()
+        self.subset = subset
+        # subset takes precedence over split, if specified
+        if subset:
+            dataset_df = dataset_df[dataset_df.subset == subset].copy()
+        else:
+            dataset_df = dataset_df[dataset_df.split == split].copy()
         self.dataset_df = dataset_df
+        self.input_type = input_type
         self.frames_paths = self.dataset_df[
-            constants.FRAMES_NPY_PATH_COL_NAME
+            constants.FRAMES_PATH_COL_NAME
         ].values
         if split != "predict":
             self.frame_labels_paths = self.dataset_df[
                 constants.FRAME_LABELS_NPY_PATH_COL_NAME
             ].values
         else:
             self.frame_labels_paths = None
-
-        if input_type == "audio":
-            self.source_paths = self.dataset_df["audio_path"].values
-        elif input_type == "spect":
-            self.source_paths = self.dataset_df["spect_path"].values
-        else:
-            raise ValueError(
-                f"Invalid `input_type`: {input_type}. Must be one of {{'audio', 'spect'}}."
-            )
-
         self.sample_ids = sample_ids
         self.inds_in_sample = inds_in_sample
         self.frame_dur = float(frame_dur)
@@ -78,10 +160,20 @@ def shape(self):
         tmp_item = self.__getitem__(tmp_x_ind)
         return tmp_item["frames"].shape
 
+    def _load_frames(self, frames_path):
+        """Helper function that loads "frames",
+        the input to the frame classification model.
+        Loads audio or spectrogram, depending on
+        :attr:`self.input_type`.
+        This function assumes that audio is in wav format 
+        and spectrograms are in npz files.
+        """
+        return helper.load_frames(frames_path, self.input_type)
+
     def __getitem__(self, idx):
-        source_path = self.source_paths[idx]
-        frames = np.load(self.dataset_path / self.frames_paths[idx])
-        item = {"frames": frames, "source_path": source_path}
+        frames_path = self.dataset_path / self.frames_paths[idx]
+        frames = self._load_frames(frames_path)
+        item = {"frames": frames, "frames_path": frames_path}
         if self.frame_labels_paths is not None:
             frame_labels = np.load(
                 self.dataset_path / self.frame_labels_paths[idx]
@@ -102,19 +194,34 @@ def from_dataset_path(
         cls,
         dataset_path: str | pathlib.Path,
         split: str = "val",
+        subset: str | None = None,
         item_transform: Callable | None = None,
     ):
-        """
+        """Make a :class:`FramesDataset` instance,
+        given the path to a frame classification dataset.
 
         Parameters
         ----------
-        dataset_path
-        split
-        item_transform
+        dataset_path : pathlib.Path
+            Path to directory that represents a
+            frame classification dataset,
+            as created by
+            :func:`vak.prep.prep_frame_classification_dataset`.
+        split : str
+            The name of a split from the dataset,
+            one of {'train', 'val', 'test'}.
+        subset : str, optional
+            Name of subset to use.
+            If specified, this takes precedence over split.
+            Subsets are typically taken from the training data
+            for use when generating a learning curve.
+        item_transform : callable, optional
+            Transform applied to each item :math:`(x, y)`
+            returned by :meth:`FramesDataset.__getitem__`.
 
         Returns
         -------
-
+        frames_dataset : FramesDataset
         """
         dataset_path = pathlib.Path(dataset_path)
         metadata = Metadata.from_dataset_path(dataset_path)
@@ -125,20 +232,26 @@ def from_dataset_path(
         dataset_df = pd.read_csv(dataset_csv_path)
 
         split_path = dataset_path / split
-        sample_ids_path = split_path / constants.SAMPLE_IDS_ARRAY_FILENAME
+        if subset:
+            sample_ids_path = split_path / helper.sample_ids_array_filename_for_subset(subset)
+        else:
+            sample_ids_path = split_path / constants.SAMPLE_IDS_ARRAY_FILENAME
         sample_ids = np.load(sample_ids_path)
-        inds_in_sample_path = (
-            split_path / constants.INDS_IN_SAMPLE_ARRAY_FILENAME
-        )
+
+        if subset:
+            inds_in_sample_path = split_path / helper.inds_in_sample_array_filename_for_subset(subset)
+        else:
+            inds_in_sample_path = split_path / constants.INDS_IN_SAMPLE_ARRAY_FILENAME
         inds_in_sample = np.load(inds_in_sample_path)
 
         return cls(
             dataset_path,
             dataset_df,
+            input_type,
             split,
             sample_ids,
             inds_in_sample,
             frame_dur,
-            input_type,
+            subset,
             item_transform,
         )
diff --git a/src/vak/datasets/frame_classification/helper.py b/src/vak/datasets/frame_classification/helper.py
@@ -0,0 +1,37 @@
+"""Helper functions used with frame classification datasets."""
+from __future__ import annotations
+
+from . import constants
+from ... import common
+
+
+def sample_ids_array_filename_for_subset(subset: str) -> str:
+    """Returns name of sample IDs array file for a subset of the training data."""
+    return constants.SAMPLE_IDS_ARRAY_FILENAME.replace(
+                '.npy', f'-{subset}.npy'
+            )
+
+
+def inds_in_sample_array_filename_for_subset(subset: str) -> str:
+    """Returns name of inds in sample array file for a subset of the training data."""
+    return constants.INDS_IN_SAMPLE_ARRAY_FILENAME.replace(
+        '.npy', f'-{subset}.npy'
+    )
+
+
+def load_frames(frames_path, input_type):
+    """Helper function that loads "frames",
+    the input to the frame classification model.
+    Loads audio or spectrogram, depending on
+    :attr:`self.input_type`.
+    This function assumes that audio is in wav format
+    and spectrograms are in npz files.
+    """
+    if input_type == "audio":
+        frames, _ = common.constants.AUDIO_FORMAT_FUNC_MAP[
+            constants.FRAME_CLASSIFICATION_DATASET_AUDIO_FORMAT
+        ](frames_path)
+    elif input_type == "spect":
+        spect_dict = common.files.spect.load(frames_path)
+        frames = spect_dict[common.constants.SPECT_KEY]
+    return frames