{"id":8259,"date":"2020-06-06T01:03:33","date_gmt":"2020-06-06T01:03:33","guid":{"rendered":"https:\/\/www.thecoderschool.com\/?p=8259"},"modified":"2022-10-13T20:49:56","modified_gmt":"2022-10-13T20:49:56","slug":"advanced-machine-learning-techniques-k-means-clustering-algorithm","status":"publish","type":"post","link":"https:\/\/www.thecoderschool.com\/blog\/advanced-machine-learning-techniques-k-means-clustering-algorithm\/","title":{"rendered":"Advanced Machine Learning Techniques: K-Means Clustering Algorithm"},"content":{"rendered":"<p><strong>By Camille D., Age 17<\/strong><\/p>\n<p><b>K-means clustering<\/b><span style=\"font-weight: 400;\"> is one of the most widely used and straightforward machine learning algorithms. Its goal is to find groups or <\/span><b>clusters<\/b><span style=\"font-weight: 400;\"> within a set of data, the number of which is denoted by <\/span><i><span style=\"font-weight: 400;\">k<\/span><\/i><span style=\"font-weight: 400;\">. It is an iterative algorithm that partitions a dataset into subgroups such that the subgroups don\u2019t overlap and each data point only has only one unique subgroup. Each cluster has a <\/span><b>centroid<\/b><span style=\"font-weight: 400;\">, or its own arithmetic mean of all of its data points. Centroids can be used to label new data.<\/span><\/p>\n<p><span style=\"font-weight: 400;\"> K-means clustering allows the user to find groups that have been created organically, rather than groups defined before the data comes up, making k-means an <\/span><b>unsupervised machine-learning algorithm<\/b><span style=\"font-weight: 400;\">. The value of <\/span><i><span style=\"font-weight: 400;\">k<\/span><\/i><span style=\"font-weight: 400;\"> itself, however, must be pre-defined.<\/span><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"size-medium wp-image-8272 aligncenter\" src=\"https:\/\/www.thecoderschool.com\/wp-content\/uploads\/2019\/06\/image1-300x146.png\" alt=\"\" width=\"300\" height=\"146\" srcset=\"https:\/\/www.thecoderschool.com\/blog\/wp-content\/uploads\/2019\/06\/image1-300x146.png 300w, https:\/\/www.thecoderschool.com\/blog\/wp-content\/uploads\/2019\/06\/image1-450x219.png 450w, https:\/\/www.thecoderschool.com\/blog\/wp-content\/uploads\/2019\/06\/image1.png 478w\" sizes=\"auto, (max-width: 300px) 100vw, 300px\" \/><\/p>\n<p><span style=\"font-weight: 400;\">Some applications of k-means clustering can be defining personalities based on interests, determining characteristics of images based on stages or categories of a disease, and dividing by activities on a mobile application or a website.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">The two inputs to the algorithm are the value for <\/span><i><span style=\"font-weight: 400;\">k<\/span><\/i><span style=\"font-weight: 400;\"> and the dataset. Centroids are first initialized by shuffling the dataset and then randomly selecting <\/span><i><span style=\"font-weight: 400;\">k<\/span><\/i><span style=\"font-weight: 400;\"> data points to be centroids for each cluster. The algorithm then follows two iterative steps:<\/span><\/p>\n<p>1. Data, centroid assignment<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"size-medium wp-image-8274 aligncenter\" src=\"https:\/\/www.thecoderschool.com\/wp-content\/uploads\/2019\/06\/image2-300x87.png\" alt=\"\" width=\"300\" height=\"87\" srcset=\"https:\/\/www.thecoderschool.com\/blog\/wp-content\/uploads\/2019\/06\/image2-300x87.png 300w, https:\/\/www.thecoderschool.com\/blog\/wp-content\/uploads\/2019\/06\/image2-450x131.png 450w, https:\/\/www.thecoderschool.com\/blog\/wp-content\/uploads\/2019\/06\/image2.png 566w\" sizes=\"auto, (max-width: 300px) 100vw, 300px\" \/><\/p>\n<p>2. Centroid updating<\/p>\n<ol>\n<li style=\"font-weight: 400; list-style-type: none;\"><\/li>\n<\/ol>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-medium wp-image-8275 aligncenter\" src=\"https:\/\/www.thecoderschool.com\/wp-content\/uploads\/2019\/06\/image3-300x64.png\" alt=\"\" width=\"300\" height=\"64\" srcset=\"https:\/\/www.thecoderschool.com\/blog\/wp-content\/uploads\/2019\/06\/image3-300x64.png 300w, https:\/\/www.thecoderschool.com\/blog\/wp-content\/uploads\/2019\/06\/image3-450x96.png 450w, https:\/\/www.thecoderschool.com\/blog\/wp-content\/uploads\/2019\/06\/image3.png 570w\" sizes=\"auto, (max-width: 300px) 100vw, 300px\" \/><\/p>\n<p><span style=\"font-weight: 400;\">Steps one and two are repeated until a terminal case is met (there is no change to the centroids, assignments of data points to clusters isn\u2019t changing).<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Something to note about choosing the value of <\/span><i><span style=\"font-weight: 400;\">k<\/span><\/i><span style=\"font-weight: 400;\"> is that we can use something called the <\/span><b>Elbow Method<\/b><span style=\"font-weight: 400;\">. For example, start by running k-means on a dataset for a range of values of <\/span><i><span style=\"font-weight: 400;\">k<\/span><\/i><span style=\"font-weight: 400;\"> (i.e. 2 to 10), and for each value of <\/span><i><span style=\"font-weight: 400;\">k<\/span><\/i><span style=\"font-weight: 400;\">, calculate the <\/span><a href=\"https:\/\/hlab.stanford.edu\/brian\/error_sum_of_squares.html\"><span style=\"font-weight: 400;\">sum of squared error<\/span><\/a><span style=\"font-weight: 400;\"> (average within-cluster distance to centroid). If we were to plot the SSE for each value of <\/span><i><span style=\"font-weight: 400;\">k<\/span><\/i><span style=\"font-weight: 400;\"> on a line plot, and we detect an inflection point, we would use the value of <\/span><i><span style=\"font-weight: 400;\">k<\/span><\/i><span style=\"font-weight: 400;\"> at which that inflection point occurs. Here is an example of such a plot, showing the ideal value for <\/span><i><span style=\"font-weight: 400;\">k<\/span><\/i><span style=\"font-weight: 400;\">:<\/span><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"size-medium wp-image-8294 aligncenter\" src=\"https:\/\/www.thecoderschool.com\/wp-content\/uploads\/2019\/06\/image4-300x164.png\" alt=\"\" width=\"300\" height=\"164\" srcset=\"https:\/\/www.thecoderschool.com\/blog\/wp-content\/uploads\/2019\/06\/image4-300x164.png 300w, https:\/\/www.thecoderschool.com\/blog\/wp-content\/uploads\/2019\/06\/image4-450x246.png 450w, https:\/\/www.thecoderschool.com\/blog\/wp-content\/uploads\/2019\/06\/image4.png 483w\" sizes=\"auto, (max-width: 300px) 100vw, 300px\" \/><\/p>\n<p><span style=\"font-weight: 400;\">An example of k-means is as follows: suppose we have four objects that each have two attributes, and we want to group these objects into <\/span><i><span style=\"font-weight: 400;\">k = 2<\/span><\/i><span style=\"font-weight: 400;\"> clusters based on the two attributes.<\/span><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"size-medium wp-image-8295 aligncenter\" src=\"https:\/\/www.thecoderschool.com\/wp-content\/uploads\/2019\/06\/image5-300x188.png\" alt=\"\" width=\"300\" height=\"188\" srcset=\"https:\/\/www.thecoderschool.com\/blog\/wp-content\/uploads\/2019\/06\/image5-300x188.png 300w, https:\/\/www.thecoderschool.com\/blog\/wp-content\/uploads\/2019\/06\/image5.png 342w\" sizes=\"auto, (max-width: 300px) 100vw, 300px\" \/><\/p>\n<p><span style=\"font-weight: 400;\">We can represent these as coordinates in an attribute space below:<\/span><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"size-medium wp-image-8296 aligncenter\" src=\"https:\/\/www.thecoderschool.com\/wp-content\/uploads\/2019\/06\/image6-300x249.png\" alt=\"\" width=\"300\" height=\"249\" srcset=\"https:\/\/www.thecoderschool.com\/blog\/wp-content\/uploads\/2019\/06\/image6-300x249.png 300w, https:\/\/www.thecoderschool.com\/blog\/wp-content\/uploads\/2019\/06\/image6.png 367w\" sizes=\"auto, (max-width: 300px) 100vw, 300px\" \/><\/p>\n<p><span style=\"font-weight: 400;\">Suppose we want to use Object A and B as the centroids. Thus, the coordinates of the centroids are (2, 3) and (1, 4).<\/span><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"size-medium wp-image-8297 aligncenter\" src=\"https:\/\/www.thecoderschool.com\/wp-content\/uploads\/2019\/06\/image7-300x260.png\" alt=\"\" width=\"300\" height=\"260\" srcset=\"https:\/\/www.thecoderschool.com\/blog\/wp-content\/uploads\/2019\/06\/image7-300x260.png 300w, https:\/\/www.thecoderschool.com\/blog\/wp-content\/uploads\/2019\/06\/image7.png 353w\" sizes=\"auto, (max-width: 300px) 100vw, 300px\" \/><\/p>\n<p><span style=\"font-weight: 400;\">Next, use Euclidean distance to calculate the cluster centroid to each individual object. We will obtain a distance matrix at iteration 0:<\/span><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"size-medium wp-image-8298 aligncenter\" src=\"https:\/\/www.thecoderschool.com\/wp-content\/uploads\/2019\/06\/image8-300x57.png\" alt=\"\" width=\"300\" height=\"57\" srcset=\"https:\/\/www.thecoderschool.com\/blog\/wp-content\/uploads\/2019\/06\/image8-300x57.png 300w, https:\/\/www.thecoderschool.com\/blog\/wp-content\/uploads\/2019\/06\/image8-450x85.png 450w, https:\/\/www.thecoderschool.com\/blog\/wp-content\/uploads\/2019\/06\/image8.png 458w\" sizes=\"auto, (max-width: 300px) 100vw, 300px\" \/><\/p>\n<p><span style=\"font-weight: 400;\">Each column of this matrix represents the objects successively. The first row represents that object\u2019s distance to Object A on the plot, and the second represents that object\u2019s distance to Object B on the plot. Next, we do the actual clustering based on the minimum distance. We will define two clusters for each centroid as <\/span><span style=\"font-weight: 400;\">C<\/span><span style=\"font-weight: 400;\">1<\/span><span style=\"font-weight: 400;\"> and <\/span><span style=\"font-weight: 400;\">C<\/span><span style=\"font-weight: 400;\">2<\/span><span style=\"font-weight: 400;\">. For example, since Object A\u2019s distance to itself is obviously smaller than Object B\u2019s distance to Object A, Object A will belong to the <\/span><span style=\"font-weight: 400;\">C<\/span><span style=\"font-weight: 400;\">1<\/span><span style=\"font-weight: 400;\">.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Let\u2019s define a new Group matrix, where the columns again correspond to each object, but their values will be assigned to either 0 or 1 based on whether they are in <\/span><span style=\"font-weight: 400;\">C<\/span><span style=\"font-weight: 400;\">1<\/span><span style=\"font-weight: 400;\">or <\/span><span style=\"font-weight: 400;\">C<\/span><span style=\"font-weight: 400;\">2<\/span><span style=\"font-weight: 400;\">. The rows correspond to each cluster successively.<\/span><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"size-full wp-image-8299 aligncenter\" src=\"https:\/\/www.thecoderschool.com\/wp-content\/uploads\/2019\/06\/image9.png\" alt=\"\" width=\"213\" height=\"72\" \/><\/p>\n<p><span style=\"font-weight: 400;\">Now that we have assigned data points to their clusters, we can compute the new centroids. Since <\/span><span style=\"font-weight: 400;\">C<\/span><span style=\"font-weight: 400;\">2<\/span><span style=\"font-weight: 400;\"> has only one member, which is Object B (1, 4) itself, the centroid will remain (1, 4). <\/span><span style=\"font-weight: 400;\">C<\/span><span style=\"font-weight: 400;\">1<\/span><span style=\"font-weight: 400;\">, however, has three members, so its new centroid will be the average coordinate of its three members: <\/span><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"size-full wp-image-8300 aligncenter\" src=\"https:\/\/www.thecoderschool.com\/wp-content\/uploads\/2019\/06\/image10.png\" alt=\"\" width=\"204\" height=\"36\" \/><\/p>\n<p><span style=\"font-weight: 400;\">We continue these two steps until the terminal case is reached, which, in this case, occurs when the number of elements in <\/span><span style=\"font-weight: 400;\">C<\/span><span style=\"font-weight: 400;\">1<\/span><span style=\"font-weight: 400;\"> is equal to the number of elements in <\/span><span style=\"font-weight: 400;\">C<\/span><span style=\"font-weight: 400;\">2<\/span><span style=\"font-weight: 400;\">.<\/span><\/p>\n","protected":false},"excerpt":{"rendered":"<p>By Camille D., Age 17 K-means clustering is one of the most widely used and straightforward machine learning algorithms. Its goal is to find groups or clusters within a set of data, the number of which is denoted by k. It is an iterative algorithm that partitions a dataset into subgroups such that the subgroups &hellip; <\/p>\n<p class=\"link-more\"><a href=\"https:\/\/www.thecoderschool.com\/blog\/advanced-machine-learning-techniques-k-means-clustering-algorithm\/\" class=\"more-link\">Continue reading<span class=\"screen-reader-text\"> &#8220;Advanced Machine Learning Techniques: K-Means Clustering Algorithm&#8221;<\/span><\/a><\/p>\n","protected":false},"author":6,"featured_media":8272,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[2],"tags":[64,69],"class_list":["post-8259","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-coder-blog","tag-about-thecoderschool","tag-student-showcase","entry"],"_links":{"self":[{"href":"https:\/\/www.thecoderschool.com\/blog\/wp-json\/wp\/v2\/posts\/8259","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.thecoderschool.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.thecoderschool.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.thecoderschool.com\/blog\/wp-json\/wp\/v2\/users\/6"}],"replies":[{"embeddable":true,"href":"https:\/\/www.thecoderschool.com\/blog\/wp-json\/wp\/v2\/comments?post=8259"}],"version-history":[{"count":1,"href":"https:\/\/www.thecoderschool.com\/blog\/wp-json\/wp\/v2\/posts\/8259\/revisions"}],"predecessor-version":[{"id":11831,"href":"https:\/\/www.thecoderschool.com\/blog\/wp-json\/wp\/v2\/posts\/8259\/revisions\/11831"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.thecoderschool.com\/blog\/wp-json\/wp\/v2\/media\/8272"}],"wp:attachment":[{"href":"https:\/\/www.thecoderschool.com\/blog\/wp-json\/wp\/v2\/media?parent=8259"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.thecoderschool.com\/blog\/wp-json\/wp\/v2\/categories?post=8259"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.thecoderschool.com\/blog\/wp-json\/wp\/v2\/tags?post=8259"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}