TDAmeritrade · joehiggi1758 · Oct 30, 2024 · Oct 30, 2024
@@ -0,0 +1,299 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Introduction\n",
+    "### Motif-Only Matrix Profile (MOMP): A Faster Approach to Motif Discovery in Time Series\n",
+    "#### *By Joey Higgins*\n",
+    "\n",
+    "In this tutorial, we will walk through the Motif-Only Matrix Profile (MOMP), an advanced technique for time series motif discovery, as proposed in the [Motif Only Matrix Profile](https://www.dropbox.com/scl/fi/mt8vp7mdirng04v6llx6y/MOMP_DeskTop.pdf?rlkey=gt6u0egagurkmmqh2ga2ccz85&e=1&dl=0) (Keogh, 2024). MOMP combines the computational efficiency of downsampling with lower-bound approximations, pruning irrelevant subsequences, and refining best motif candidates. This results in a significant speedup compared to traditional Matrix Profile algorithms.\n",
+    "\n",
+    "Ultimately, we will walk through the MOMP algorithm with enhancements such as the K-Triangular Inequality Profile (KTIP) and multiresolution pruning. We will also test the performance of MOMP on real-world datasets and compare it with other matrix profile algorithms like STOMP.\n",
+    "\n",
+    "### Objectives:\n",
+    "1. Understand how to compute the Lower Bound Matrix Profile (lbMP) and KTIP for aggressive pruning.\n",
+    "2. Implement multiresolution pruning to refine motif search from coarse to fine resolution.\n",
+    "3. Refine the best-so-far (bsf) motif distance with exact distance calculations and cohort point adjustments.\n",
+    "4. Run performance comparisons on real-world datasets."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": []
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Table of Contents\n",
+    "\n",
+    "1. [Introduction](#introduction)\n",
+    "2. [Definitions](#definitions)\n",
+    "3. [MOMP Algorithm Overview](#momp-algorithm-overview)\n",
+    "4. [Key Steps in MOMP](#key-steps-in-momp)\n",
+    "   - [Step 1: Downsampling](#step-1-downsampling)\n",
+    "   - [Step 2: Lower Bound Matrix Profile (lbMP) and KTIP](#step-2-lower-bound-matrix-profile-and-ktip)\n",
+    "   - [Step 3: Multiresolution Pruning](#step-3-multiresolution-pruning)\n",
+    "   - [Step 4: Refining the Best-So-Far (bsf) Motif](#step-4-refining-the-best-so-far-motif)\n",
+    "   - [Step 5: Final Exact Matrix Profile Calculation](#step-5-final-exact-matrix-profile-calculation)\n",
+    "5. [Performance Comparisons](#performance-comparisons)\n",
+    "6. [Real-World Dataset Testing](#real-world-dataset-testing)\n",
+    "7. [Conclusion](#conclusion)\n",
+    "8. [References](#references)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "#### Getting Started\n",
+    "Importing all required packages"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import pandas as pd\n",
+    "import numpy as np\n",
+    "import stumpy\n",
+    "import math\n",
+    "\n",
+    "np.set_printoptions(linewidth=100)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## MOMP Algorithm Overview\n",
+    "\n",
+    "Motif-Only Matrix Profile (MOMP) improves traditional motif discovery by aggressively pruning subsequences using the **Lower Bound Matrix Profile (lbMP)** and the **K-Triangular Inequality Profile (KTIP)**. Starting with a coarse downsampling rate, the algorithm performs multiresolution pruning, gradually refining the motif search and recalculating the exact matrix profile for unpruned subsequences at the final stage.\n",
+    "\n",
+    "### Key Enhancements:\n",
+    "- **Lower Bound Matrix Profile (lbMP)**: The lbMP stores rough estimates of subsequence distances, allowing for pruning.\n",
+    "- **K-Triangular Inequality Profile (KTIP)**: KTIP leverages the triangular inequality to refine subsequence distance estimates and prune unpromising pairs.\n",
+    "- **Multiresolution Pruning**: The motif search begins with coarse approximations and progressively increases resolution to focus on promising subsequences.\n",
+    "- **Cohort Points**: These are anchor points used in the final motif refinement stage to ensure local subsequences are correctly aligned."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": []
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Definitions\n",
+    "\n",
+    "Before we dive into the implementation, let’s define some key terms that will help you understand the MOMP process:\n",
+    "\n",
+    "- **Best-So-Far (bsf)**: The smallest distance between any two subsequences that has been found so far. As the algorithm progresses, the bsf is updated whenever a smaller distance is discovered.\n",
+    "- **Cohort Points**: Cohort points are the anchor subsequences that help refine the best-so-far (bsf) motif distance during the final stages of the algorithm.\n",
+    "- **Downsampling**: The process of reducing the resolution of the time series by averaging over groups of data points. Downsampling speeds up initial calculations by working with a coarser representation of the time series.\n",
+    "- **dsr**: Downsampling Rate, or the factor by which the time series is reduced. For example, a dsr of 2 means that every two points in the original time series are averaged into one point.\n",
+    "- **Lower Bound**: A rough estimate of the minimum possible distance between subsequences, computed using the downsampled time series. Lower bounds are used to quickly prune unpromising subsequences before calculating the exact distance.\n",
+    "- **lbMP**: Lower Bound Matrix Profile, which stores the lower bound distances between subsequences in the time series. It helps in identifying which subsequences can be pruned.\n",
+    "- **Matrix Profile (MP)**: A data structure that stores the z-normalized Euclidean distance between each subsequence in a time series and its nearest neighbor. The MP is used to efficiently identify motifs in the data.\n",
+    "- **Motif**: A repeating pattern in a time series that occurs at least twice. Motifs are subsequences with minimal Euclidean distances between them.\n",
+    "- **Multiresolution Pruning**: This refers to the process of starting the motif search at a coarse downsampling rate, pruning subsequences based on lower bounds, and iteratively refining the search at finer resolutions.\n",
+    "- **Pruning**: The process of eliminating subsequences that cannot possibly be motifs based on their lower bound distance. If the lower bound of a subsequence's distance is already greater than the current bsf, it is pruned.\n",
+    "\n",
+    "These concepts are essential for understanding how MOMP works."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Step 1: Downsampling the Time Series\n",
+    "\n",
+    "To start, we reduce the resolution of the time series by averaging groups of points. This allows us to quickly approximate the distances between subsequences.6"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Step 2: Computing Lower Bound Approximation and K-Triangular Inequality Profile (KTIP)\n",
+    "\n",
+    "The next step is to calculate the lower bound distances between subsequences in the downsampled time series using the K-Triangular Inequality Profile (KTIP) algorithm. KTIP computes a matrix of lower bound distances at various downsampling rates, leveraging powers of 2 to capture increasingly accurate estimates with minimal computation. These lower bounds help us quickly identify which parts of the time series are likely irrelevant by providing a fast approximation of distances. This allows us to efficiently \"prune\" or ignore segments that are unlikely to contain the best matches, focusing our search on the most promising regions."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "KTIP Matrix:\n",
+      " [[1.73205081 1.73205081 1.73205081]\n",
+      " [1.73205081 1.73205081 1.73205081]\n",
+      " [1.73205081 1.73205081 1.73205081]\n",
+      " [1.73205081 1.73205081 1.73205081]\n",
+      " [1.73205081 1.73205081 1.73205081]\n",
+      " [1.73205081 1.73205081 1.73205081]\n",
+      " [1.73205081 1.73205081 1.73205081]\n",
+      " [1.73205081 1.73205081 1.73205081]]\n"
+     ]
+    }
+   ],
+   "source": [
+    "def computeKTIP(T, m, dsr0):\n",
+    "    \"\"\"\n",
+    "    Compute the K-Triangular Inequality Profile (KTIP).\n",
+    "    \n",
+    "    Parameters:\n",
+    "    - T: Input time series (array-like)\n",
+    "    - m: Subsequence length (integer)\n",
+    "    - dsr0: Initial downsampling rate (integer)\n",
+    "    \n",
+    "    Returns:\n",
+    "    - ktip: Lower bound matrix profile\n",
+    "    \"\"\"\n",
+    "\n",
+    "    n = len(T)\n",
+    "    num_diags = int(math.log2(dsr0)) + 1  # Number of diagonal levels based on dsr0\n",
+    "    ktip = np.full((n - m + 1, num_diags), np.nan)  # Initialize ktip with NaN values\n",
+    "    temp = np.full((n - m + 1), np.inf)  # Initialize temp with infinity values\n",
+    "    \n",
+    "    for diag in range(1, dsr0 + 1):\n",
+    "        for rr in range(n - m - diag + 2):\n",
+    "            cc = rr + diag\n",
+    "            if cc >= len(T) - m + 1:\n",
+    "                break  # Avoids accessing out-of-bounds indices\n",
+    "            dist = np.sqrt(np.sum((T[rr:rr + m] - T[cc:cc + m]) ** 2))\n",
+    "            \n",
+    "            # Update temp for minimum distances\n",
+    "            if dist < temp[rr]:\n",
+    "                temp[rr] = dist\n",
+    "\n",
+    "            if dist < temp[cc]:\n",
+    "                temp[cc] = dist\n",
+    "        \n",
+    "        # Store temp values in ktip at log2(diag) positions if diag is a power of 2\n",
+    "        if math.log2(diag).is_integer():\n",
+    "            ktip[:, int(math.log2(diag))] = temp\n",
+    "    \n",
+    "    return ktip\n",
+    "\n",
+    "T = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])  # Sample time series\n",
+    "m = 3  # Subsequence length\n",
+    "dsr0 = 4  # Initial downsampling rate\n",
+    "\n",
+    "ktip_result = computeKTIP(T, m, dsr0)\n",
+    "print(\"KTIP Matrix:\\n\", ktip_result)\n",
+    "\n",
+    "# Basic checks\n",
+    "assert ktip_result.shape == (len(T) - m + 1, int(math.log2(dsr0)) + 1), \"Output dimensions incorrect.\"\n",
+    "assert not np.isnan(ktip_result).all(), \"All values in the result are NaN.\""
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "KTIP Matrix:\n",
+      " [[1.73205081 1.73205081 1.73205081]\n",
+      " [1.73205081 1.73205081 1.73205081]\n",
+      " [1.73205081 1.73205081 1.73205081]\n",
+      " [1.73205081 1.73205081 1.73205081]\n",
+      " [1.73205081 1.73205081 1.73205081]\n",
+      " [1.73205081 1.73205081 1.73205081]\n",
+      " [1.73205081 1.73205081 1.73205081]\n",
+      " [1.73205081 1.73205081 1.73205081]]\n",
+      "Test passed: KTIP matrix dimensions and basic values are valid.\n"
+     ]
+    }
+   ],
+   "source": [
+    "def test_computeKTIP():\n",
+    "\n",
+    "\n",
+    "# Run the test\n",
+    "test_computeKTIP()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Step 3: Pruning Irrelevant Data\n",
+    "\n",
+    "Based on the lower bound distances, we can prune subsequences that are guaranteed not to contain motifs. These subsequences have a lower bound greater than the current best-so-far (bsf) motif distance."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Step 4: Refining the Best-So-Far (bsf) Motif\n",
+    "\n",
+    "After pruning, we refine the best-so-far (bsf) by recalculating the exact distances between the remaining subsequences."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Implementation\n",
+    "\n",
+    "Let's combine all the steps into a single MOMP function. This function will downsample the data, compute the lower bounds, prune subsequences, and refine the best motif."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Conclusion\n",
+    "\n",
+    "In this tutorial, we explored how the Motif-Only Matrix Profile (MOMP) algorithm speeds up motif discovery by using downsampling and lower bounds to prune irrelevant subsequences. This approach makes motif discovery scalable even for very large time series."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## References\n",
+    "\n",
+    "Shahcheraghi, Maryam and Keogh, Eamonn et al. (2024) Matrix Profile XXXI: Motif-Only Matrix Profile: Orders of Magnitude Faster. ICDM: TBD. [Link](https://www.dropbox.com/scl/fi/mt8vp7mdirng04v6llx6y/MOMP_DeskTop.pdf?rlkey=gt6u0egagurkmmqh2ga2ccz85&e=1&dl=0)"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "stumpy-env",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.11.10"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}