unsupervised machine learning

Unsupervised machine learning is a type of machine learning where the algorithm is given input data without explicit instructions on what to do with that data. In other words, the algorithm is not provided with labeled output or target values. The goal of unsupervised learning is to find patterns, relationships, or structures within the data without being explicitly told what to look for.


There are two main types of unsupervised learning:


1. Clustering: The algorithm tries to group similar data points together based on certain features or characteristics. The objective is to discover inherent structures within the data. K-means clustering and hierarchical clustering are common techniques used for this purpose.


2. Dimensionality Reduction:The algorithm aims to reduce the number of features in the data while retaining as much relevant information as possible. This is often done to simplify the dataset and improve the efficiency of subsequent processing. Principal Component Analysis (PCA) and t-Distributed Stochastic Neighbor Embedding (t-SNE) are examples of dimensionality reduction techniques.


Unsupervised learning is particularly useful when dealing with large and complex datasets, as it can reveal hidden patterns or insights that may not be apparent through manual inspection. Applications of unsupervised learning include customer segmentation, anomaly detection, and feature extraction, among others.


Certainly! Let's consider a real-world example of using k-means clustering for customer segmentation in a retail setting. In this scenario, we want to group customers based on their purchasing behavior.


**Step 1: Data Collection:**

Assume we have collected data on customers, including the number of purchases they made per month and the average amount they spent per purchase.


```

Customer   Purchases per Month   Average Spend per Purchase

1          3                    $50

2          1                    $200

3          5                    $30

4          6                    $25

5          2                    $150

6          4                    $40

7          8                    $20

8          7                    $15

9          1                    $180

10         3                    $60

```


**Step 2: Feature Scaling:**

It's common to scale the features to ensure that they contribute equally to the clustering process. For simplicity, let's assume the data is already scaled.


**Step 3: Apply K-Means:**

Now, let's apply the k-means algorithm. For this example, let's set k = 3, meaning we want to group customers into three segments.


- **Initialization:** Randomly select three initial centroids.

- **Assignment:** Assign each customer to the cluster with the nearest centroid based on their purchasing behavior.

- **Update:** Recalculate the centroids based on the mean of the data points in each cluster.

- **Repeat:** Iterate the assignment and update steps until convergence.


After several iterations, the algorithm might converge to the following clusters:


- **Cluster 1:**

  - Customers: 3, 4, 6

  - Characteristics: High frequency, low average spend


- **Cluster 2:**

  - Customers: 2, 5, 9, 10

  - Characteristics: Moderate frequency, high average spend


- **Cluster 3:**

  - Customers: 1, 7, 8

  - Characteristics: Low frequency, low to moderate average spend


**Step 4: Interpretation:**

Now, we can interpret the clusters. For example:

- Cluster 2 represents high-value customers who make fewer purchases but spend more on each purchase.

- Cluster 1 represents customers who make frequent purchases but spend less on average.

- Cluster 3 represents customers with low frequency and low to moderate spending.


These customer segments can then be used for targeted marketing strategies or personalized recommendations based on the preferences of each segment.

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

from sklearn.cluster import KMeans

from sklearn.preprocessing import StandardScaler


# Generate a hypothetical customer dataset

data = {

    'TotalAmountSpent': np.random.randint(100, 1000, 100),

    'NumberOfPurchases': np.random.randint(1, 20, 100),

    'FrequencyOfPurchases': np.random.randint(1, 10, 100),

    'AverageOrderValue': np.random.uniform(20, 200, 100),

    'TimeSinceLastPurchase': np.random.randint(1, 365, 100)

}


df = pd.DataFrame(data)


# Standardize the data

scaler = StandardScaler()

scaled_data = scaler.fit_transform(df)


# Apply k-means clustering

kmeans = KMeans(n_clusters=3, random_state=42)

df['Cluster'] = kmeans.fit_predict(scaled_data)


# Visualize the clusters

plt.scatter(df['TotalAmountSpent'], df['NumberOfPurchases'], c=df['Cluster'], cmap='viridis', edgecolors='k', s=50)

plt.xlabel('Total Amount Spent')

plt.ylabel('Number of Purchases')

plt.title('Customer Segmentation using K-Means Clustering')

plt.show()




import pandas as pd
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt

# Creating a hypothetical weather report dataset
data = {
    'City': ['City1', 'City2', 'City3', 'City4', 'City5'],
    'Temperature (C)': [25.0, 22.5, 30.0, 18.0, 28.5],
    'Humidity (%)': [60, 45, 70, 30, 80],
    'Precipitation (mm)': [5.0, 0.0, 15.0, 2.0, 10.0],
}

weather_df = pd.DataFrame(data)

# Selecting numeric columns for clustering
numeric_columns = weather_df[['Temperature (C)', 'Humidity (%)', 'Precipitation (mm)']]

# Standardize the data
scaler = StandardScaler()
numeric_columns_scaled = scaler.fit_transform(numeric_columns)

# Choose the number of clusters (k)
k = 3

# Apply k-means clustering
kmeans = KMeans(n_clusters=k, random_state=42)
clusters = kmeans.fit_predict(numeric_columns_scaled)

# Add cluster labels to the DataFrame
weather_df['Cluster'] = clusters

# Display the DataFrame with cluster labels
print(weather_df)

# Visualize the clusters
plt.scatter(weather_df['Temperature (C)'], weather_df['Humidity (%)'], c=clusters, cmap='viridis')
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s=300, c='red', marker='X')
plt.xlabel('Temperature (C)')
plt.ylabel('Humidity (%)')
plt.title('K-Means Clustering of Weather Data')
plt.show()
```

and precipitation for different cities. It then standardizes the numeric columns, performs k-means clustering, and adds cluster labels to the DataFrame. Finally, it visualizes the clusters on a scatter plot.


import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans

# Hypothetical customer data (already defined in the previous response)
data = {
    'CustomerID': range(1, 42),
    'Annual Income (in $1000s)': [15, 15, 16, 16, 17, 17, 18, 18, 19, 19, 
                                  19, 20, 20, 20, 20, 21, 21, 23, 23, 24, 
                                  24, 25, 25, 28, 28, 28, 28, 29, 29, 30, 
                                  30, 33, 33, 33, 33, 34, 34, 37, 37, 38, 
                                  38],
    'Spending Score (1-100)': [39, 81, 6, 77, 40, 76, 6, 94, 3, 72, 
                                14, 99, 15, 77, 13, 79, 35, 29, 98, 35, 
                                73, 5, 73, 14, 82, 32, 61, 31, 87, 4, 
                                73, 4, 92, 14, 81, 17, 73, 26, 75, 35, 
                                92]
}

# Create a DataFrame
df = pd.DataFrame(data)

# Selecting features for clustering
X = df[['Annual Income (in $1000s)', 'Spending Score (1-100)']]

# Number of clusters (K)
k = 5

# Apply K-means clustering
kmeans = KMeans(n_clusters=k, init='k-means++', random_state=42)
y_kmeans = kmeans.fit_predict(X)

# Add cluster labels to DataFrame
df['Cluster'] = y_kmeans

# Visualizing the clusters
plt.figure(figsize=(10, 6))

# Plot each cluster
for cluster in range(k):
    plt.scatter(X[df['Cluster'] == cluster]['Annual Income (in $1000s)'],
                X[df['Cluster'] == cluster]['Spending Score (1-100)'],
                label=f'Cluster {cluster + 1}',
                s=100)

# Plot centroids
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1],
            marker='X', s=300, c='black', label='Centroids')

plt.title('Clusters of customers')
plt.xlabel('Annual Income ($1000s)')
plt.ylabel('Spending Score (1-100)')
plt.legend()
plt.grid(True)
plt.show()

# Display the DataFrame with cluster labels
print(df[['CustomerID', 'Annual Income (in $1000s)', 'Spending Score (1-100)', 'Cluster']])



0 Comments