There different projecst in this repository that each cover a different section in the datamining world; from preprocessing and analyzing data to classification, clustering, etc.
Data preprocessing is a crucial skill for any data scientist or machine learning engineer. In this exercise, we explore data preprocessing using the popular libraries Pandas and Scikit-learn.
In a real-world data science project, data preprocessing plays a vital role in the success of a model. Properly preprocessed data and well-engineered features contribute to better model performance compared to poorly preprocessed data.
For this exercise, we use the Iris dataset. Each flower in the dataset is described by four features:
- Sepal length in cm
- Sepal width in cm
- Petal length in cm
- Petal width in cm
The following preprocessing steps have been performed:
- Removing NaN (Not a Number) data
- Converting categorical data to numerical data
- Normalization
- Principal Component Analysis (PCA)
-
Removing NaN Data: We handled missing data by removing any instances with NaN values.
-
Converting Categorical Data to Numerical Data: Categorical data was transformed into numerical values for further processing.
-
Normalization: The data was normalized to bring all features to a similar scale, which helps in better model convergence.
-
Principal Component Analysis (PCA): PCA was used to reduce the dimensionality of the data and retain essential information.
We have created box plots for each of the four features in the Iris dataset. Box plots give us an overview of the distribution and variability of the data for each feature.
In this exercise, we designed several neural networks for classification tasks and evaluated their performance. The networks were tested on the make_circles
and fashion_mnist
datasets. For the fashion_mnist
dataset, we also plotted the confusion matrix.
This dataset consists of concentric circles of two classes, making it a non-linearly separable classification problem.
Fashion MNIST is a dataset of fashion items, containing 10 classes of various clothing items, each represented by 28x28 grayscale images. The picture below show the confusion matrix:
In this project, we focus on implementing algorithms related to clustering and association rule mining.
One of the simple and widely used clustering algorithms is the K-means algorithm. Our main goal in this exercise is not to implement the K-means algorithm from scratch, but rather to get familiar with it.
To determine the optimal value of K (number of clusters), we use the elbow method. The elbow method helps us find the "elbow point" in the plot, which indicates the best value of K where the distortion starts to level off.
For this task, we preprocess the digit dataset by converting each 64 numbers into an 8x8 matrix, and we consider the number of clusters to be 10. Then, we calculate the centroids for each cluster.
The resulting centroids accurately represent the labels of their corresponding clusters, as shown in the visualization below:
One of the practical applications of the K-means algorithm is image compression. In this exercise, we applied the K-means algorithm for image compression to reduce the size of an image.
The image of a bird was used for this experiment. The goal was to reduce the number of colors in the image while preserving its visual content. Here are the steps we followed for image compression:
-
Convert the image from its RGB representation to a list of pixels, each with its corresponding RGB values.
-
Apply the K-means clustering algorithm to group the colors into 4 clusters. The centroids of these clusters will represent the new colors.
-
Replace each pixel's RGB values with the RGB values of the cluster's centroid to compress the image.
The compressed image shows the bird with reduced colors. Even though the number of colors is reduced, the visual content of the image is still preserved.
DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is another clustering algorithm commonly used for clustering data. It is capable of clustering data with different shapes and can also identify noise and anomalies present in the data.
In this section, we will use the DBSCAN algorithm to cluster the two datasets mentioned in the "Complex Clustering" section.
The DBSCAN algorithm does not require specifying the number of clusters beforehand, which makes it more flexible for various datasets. The main parameters of this algorithm are:
- Epsilon (eps): The maximum distance between two data points for them to be considered in the same neighborhood.
- MinPts: The minimum number of data points required to form a cluster.
Finding the optimal values for epsilon and MinPts is essential for obtaining accurate clustering results. For epsilon, we can try different values and evaluate their impact on the clustering quality. Additionally, we can use the KNN (K-Nearest Neighbors) distance method to automatically determine the appropriate epsilon value based on the average distance to k-nearest neighbors.
Determining MinPts is not automated, but we can follow some general guidelines based on the characteristics of the dataset to find suitable values.
In this project, we will experiment with various MinPts values to observe the effect on the clustering results. We will plot the clustering output for different MinPts values and analyze the clusters' quality based on our dataset's specific features.
Remember that different MinPts values might lead to different clustering outcomes, and it's crucial to choose the one that best fits the dataset and the desired level of clustering granularity.
Association rule mining is a technique that reveals the relationships and mutual dependencies among a large set of data items.
A common example of association rule mining is "Market Basket Analysis." In this process, based on the different items placed in customers' shopping carts, the buying habits and patterns of customers are analyzed. By identifying the associations between products, we can discover recurring patterns during purchases.
Three important parameters:
- Support: It measures the popularity of an itemset based on its occurrence frequency in transactions.
- Confidence: It indicates the likelihood of buying item y if item x is purchased. (x -> y)
- Lift: It combines the two parameters above.
In this exercise, we will implement the Apriori algorithm, one of the most popular and efficient algorithms for association rule mining.
Explaining Lift: As mentioned in the last section, if the lift value is greater than 1, it means that the occurrence of the consequent has a positive effect on the occurrence of the antecedent, and they appear together more than expected. On the other hand, if the lift is less than 1, it indicates that the consequent and antecedent appear together less than expected.
Apriori Algorithm: The working of the Apriori algorithm is as follows: A minimum support value is set, and iterations are performed with frequent itemsets. If subsets and subgroups have support values less than the threshold, they are removed. This process continues until no further pruning is possible.
In this exercise, we will apply the Apriori algorithm to the Hypermarket_dataset, which contains purchase orders of individuals from grocery stores.
To start the work, we need to prepare the dataset into a sparse matrix, where products purchased will be in columns, and order numbers will be the index.
By applying the Apriori algorithm with a minimum support of 0.07, we will generate all the frequent patterns.
We have also written a function that, given the confidence and lift as inputs, displays the resulting association rules.