## Advanced Machine Learning Techniques: Principal Component Analysis

By Camille D., Age 17

This article will focus on a method data scientists and programmers use to make data easier to explore, visualize, and interpret data, called principal component analysis (PCA). The explanations in this article assume some background in linear algebra and statistics.

PCA is based on dimensionality reduction: “the process of reducing the number of random variables under consideration by obtaining a set of principal variables,” in other words, transforming a large dataset into a smaller one without extracting too much key information. This process is considered expensive for machine learning algorithms; a little accuracy must be traded for simplicity. Minimizing this cost is part of the job for PCA.

The first step of PCA is standardization, the process that is the least mathematically involved. Standardization takes care of the variances within the initial variables, specifically with regards to their ranges. For example, the value of one variable may lie within the range of 0 to 10, and the value of another within the range of 0 to 1. The variable whose possible value lies between 0 and 10 will carry a greater weight over the second variable, leading to biased results. Mathematically, this can be addressed by subtracting the dataset’s mean from the value of the variable and dividing this result by the set’s standard deviation.

After standardization is performed, the values of each variable will all be within the same range.

Note that standardization is different from normalization in descriptive statistics. Normalization rescales the values into a range from 0 to 1, while standardization rescales the dataset to have a mean of 0 and a standard deviation of 1. Normalization is performed with the following equation:

In almost any case, of course, this will yield a value smaller than 1.

The second step, covariance matrix computation, is where things unfortunately begin to get more complicated. We first must understand the definition of covariance: “a measure of how much two random variables vary together.”

**Covariance differs from correlation in that correlation describes how strongly two variables are related, while covariance indicates the extent to which two random variables change with one another. The values of covariance lie between -∞ and ∞, while the values of correlation lie between -1 and 1. Correlations can be obtained only when the data is standardized.

Covariance matrix computation aims to investigate how the variables in the input dataset are related to one another. This is important because it helps detect redundant information that may come from a high correlation between two elements. We compute a covariance matrix to determine these correlations. The covariance matrix is an nn matrix, where nis the number of dimensions, that has entries of all possible covariances within the dataset. For example, for a two dimensional dataset with entries x, y, the covariance matrix is as follows:

A couple notes:

• Cov(x,x)=Var(x), or the variance of the initial variable.
• The Cov()operator is commutative, so Cov(x,y)=Cov(y,x), so the upper and lower triangular portions of the matrix are equal.

The covariance matrix is simply an organization that lists the correlations between all possible pairs of variables. The sign of the value of the covariance is what tells us about the correlations between elements. If the covariance is positive, then the two variables are directly correlated. If the covariance is negative, the relationship between the two variables is an inverse correlation.

The next step in PCA is actually identifying the principal components by computing the eigenvectors and eigenvalues of the covariance matrix. However many principal components are produced from the dataset should be equal to the amount of dimensions in the set. Principal components are “combinations” or “mixtures” of the initial variables, and are constructed such that each of them are uncorrelated and as much information from the variability initial variables as possible is stored in the first component, and the succeeding components account for the remaining information, as shown in the example plot below for an 8-dimensional dataset:

This form helps significantly with dimensionality reduction because it eliminates components with little to no information while still retaining the information that describes the key relationships within the data. Consider the dataset below:

The direction first principal component line represents the direction of the highest variability in the data. Since the variability is the largest in the first component, the information captured by the first component is also the largest. It’s the line in which the projection of the points onto the line is the most spread out. This line maximizes the average of the squared distances from the projected points to the origin. The direction of the second principal component line should be orthogonal in order for the principal components to be completely uncorrelated.

We continue to calculate principal components n times, where n is the original number of values in the dataset.

Going back to eigenvectors and eigenvalues, here are a couple preliminary notes about eigenvectors and eigenvalues:

• Every eigenvector has its own corresponding eigenvalue.
• The number of eigenvectors and corresponding eigenvalues is equal to the number of dimensions/variables in the data.
• For a tutorial on how to calculate the eigenvalues and eigenvectors of a matrix: https://www.scss.tcd.ie/~dahyotr/CS1BA1/SolutionEigen.pdf.

The eigenvectors of the covariance matrix give the directions of the principal component axes, and the eigenvalues are coefficients for the eigenvectors, and give the scalar amount of variance within each PC. The PCs in order of significance can be obtained by ranking the eigenvalues for each eigenvector from highest to lowest. To get the percentages of the variance carried by each PC, divide each of eigenvalue by the sum of all eigenvalues.

Next, we have to determine whether we want to keep some of the lesser components (the ones with low eigenvalues). We form a matrix called the feature vector with the eigenvectors of the components we do keep. This demonstrates the concept of dimensionality reduction since we are subtracting from the initial amount of principal components we had, which is equal to the dimension of the original dataset.

Lastly, we use our feature vector to restructure our dataset in a sense. We want to put our data in terms of the axes given by the principal components instead of the original axes. We can do this pretty easily by multiplying the transpose of the feature vector by the transpose of the original dataset.

## How Face Filters and Face Recognition Work

By Mira B., Age 14

Many people use social media apps such as Instagram or Snapchat, which have face filters for people to take and post pictures of themselves. But many people do not realize how these filters are created and the technology behind how they fit people’s faces almost perfectly. The mechanics behind face filters was originally created by a Ukrainian company called Looksery; they used the technology to photoshop faces during video chats. Snapchat bought their algorithm, called the Viola-Jones algorithm, and created the face filters seen in many social media apps today.

Creating face filters is more difficult than you may think, so I’ll break it down into five key steps:

The first step is face detection. The image is initially viewed in ones and zeros, so the algorithm scans the image, looking specifically for color patterns. This can include finding that the cheek is lighter than the eye or that the nose bridge is lighter than surrounding areas. After detecting these patterns, a face can be distinguished in the camera.

The second step is the landmark extraction. Using specific algorithms in a 2D image, facial features such as the chin, nose, forehead, etc are determined.

The third step is face alignment. The coordinates of landmarks on people’s faces are taken to properly fit the filter to a particular face.

The fourth step is 3D mesh. Using the 2D image, a 3D model of the user’s face is built to fit the filter animation to a specific face.

The last step is face tracking, which approximates and locates the 3D mask in real time. This allows the user to move their face without the filter disappearing or moving to an incorrect location.

Another way to think of these steps is to imagine a human body. The landmarks identified in a 2D image serve as the skeleton for the future mask. Similar to how bodies differ in shape, so do people’s face structures. Using face alignment, the filter matches with the coordinates of landmarks from a certain face. People’s skin makes them look the way they are and 3D mesh step is like aligning the skin to the skeleton. Similar to how bodies move while keeping the skeleton, skin and muscle together, face tracking follows the face to make sure the filter stays on the right coordinates.

## The Julia Language – A New Pillar of Scientific Computing

By Camille D., Age 17

Developed in 2009 and made available in 2012, Julia is one of the fastest-growing languages in the industry. As it routinely makes an appearance in language popularity rankings, there is a potentiality for the language to outshine languages such as Python in the realm of computational science and general programming.

Julia was created by Jeff Bezanson, Stefan Karpinski, Viral B. Shah, and Alan Edelman, with a collective desire to unify the best amenities of all the big languages, from the “speed of C” to the  “dynamism of Ruby.” It was crafted with flexibility and versatility in mind – the language boasts the ability of its users to “write an algorithm … and apply it to an infinite lattice of types.”

No language is perfect; you are always making a trade-off when choosing a language to learn. A C++ pupil will enjoy the language’s high speed, but will miss out on the straightforwardness and garbage collection capabilities of Java. There will never be such thing as a language that will solve every issue or be free of any shortcomings whatsoever. Nonetheless, programming languages have evolved rapidly, and Julia exemplifies how far they have come. Here are a few reasons to choose Julia.

Julia is fast and high-performing. Applications created with Julia use the LLVM Compiler Infrastructure to efficiently compile the code to machine language for multiple different platforms. When writing code in a compiled language, you must explicitly define the types of variables you will use and the operations intended to be performed on them. Since the hardware will know exactly what to do as a result, the code will be executed quickly and efficiently. On the other hand, the CPU does not have a concept of the “variables” you use when writing in an interpreted language. The interpreter must provide instructions to the CPU about what the variables contain (i.e. int vs. float data type), forcing the CPU to wait. This is what makes interpreted languages slow relative to compiled languages such as C. Julia falls somewhere in the middle of the spectrum of compiled and interpreted languages. Julia’s compiler doesn’t have to have the information previously mentioned, but it is prepared for when a function is called and acquires all the material promptly. From the information provided, the compiler puts together fast and precise CPU instructions.

Julia is packed with immense capabilities in data science and numerical computing. When using Julia, it is evident that conventional mathematics become closely bound with programming. The Julia REPL (a programming environment in which a user types in a command and can easily see the result of their command) gives access to symbols often used in mathematics, including Greek letters and subscripts. The symbols are inserted by typing a backslash \, followed by a string corresponding to the character. For example, entering “\Gamma” will return the Gamma symbol Γ.

A rather unique feature that comes with Julia is function composition, which is achieved by the operator (∘). For example, writing (sqrt ∘ *)(5, 2) will multiply two numbers, 5 and 2, and then find the square root of the result. Julia is also packed with external call support, and can link with a throng of languages including Python, Java, C++, and R. Python applications can call Julia through PyJulia, and R applications can call it through through its interface, JuliaCall.

Julia is versatile, which is the principal reason why it is so ahead in the game. It provides a wealth of tools and frameworks for deep learning, data visualization, and graphs, and capabilities for clustering, trees, and generalized linear models. Even with a seemingly infinite capacity for mathematical transformations, however, Julia is excellent for general programming, as users can write UIs, statically compile code (even though it is generally dynamic – types of variables aren’t known until runtime), and deploy it on a webserver.

## Web-Scraping and Its Uses

By Noah S., Age 16

Web-scraping is exactly what it sounds like. Scraping the web for specific stuff, determined by the engineer. For example, instead of copy and pasting every book on a library website into a spreadsheet, a web-scraper can programmatically find every book title and paste it into an arraylist. The engineer then can take this arraylist and turn it into a spreadsheet. The result is the same, but without human errors and a lot of time saved. As a result, web-scraping has many, many uses. For example, one could make a list of dog species, find a specific link that is hard to find within a large website, or even generate the upcoming release date of a new Star Wars movie. There is one big roadblock, however. Scraping google, along with many websites, is considered illegal since it violates the terms of service that most companies enforce. Afterall, you are stealing data from companies, which may have spent lots of money acquiring. There are few ways of getting around this issue, but the best way to not get in trouble is to never release the information or use it to churn a profit. After all, why would a company be angry at an individual for making a list of cat videos?

To conclude, web-scraping is a very versatile option for both lazy and efficient people to get ahold of lots of data without having to individually look up every single option and click every single link. It can be written in many languages, specifically Object Oriented Programming languages (like Java and python), making it a skill many programmers can understand. While there are some roadblocks, if the user is fairly smart with their use of web-scraping, it should result with lots of data obtained and even more time saved.

## Coding Battleship with Game Theory

By Noah S., Age 16

Game theory is actually not the name of a youtube channel. It is, but that term is actually derived off a complex study called “Game Theory”. Game theory is the study of interactions between intelligent “things”. Whether this thing is a human, AI, or some other kind of rational thinking thing it up to you. Game theory is most commonly used in political science and economics, although its use can also be widened to cover topics like logic and computer science. This is what I researched while I made my battleship, which utilized 2 intelligent things as well, the player (you) and the AI. This topic is very broad, so I will discuss some examples and try to tie them together at the end of this post.

One famous study of game theory is the “prisoner’s dilemma”. Two prisoners, A and B, are being interrogated for the same crime. If both prisoners rat each other out, they get 5 years of prison. If A rats out B, A is set free while B must serve 10 years, and vice versa. If they both stay quiet, however, they each only have to serve 2 years in prison. Even though they both have a significantly better outcome if they both stay silent, the probability of them cooperating is actually very low, and the odds of at least one of them ratting the other out is significantly high. While the reasons are really implied since there is no direct answer (like how in english class there are many ways to interpret something, while in math there is only 1 answer), the risk of getting rewarded by being set free outweighs all other costs, including not receiving the most severe punishment of 10 years, leads many to do so. In addition, mistrust of the other also magnifies this effect.

Another example is battleship. I actually made a program that runs battleship with a complex AI. The board layout is actually determined both a combination of statistics and game theory. When playing battleship, one of the player’s primary goals is to aim for more open spaces to shorten the game and give themselves a better chance at winning. But where should a player aim to give themselves the largest chance at hitting a target? This is determined by game theory. While there is no way to play battleship to give yourself a 100% chance at winning, as it is a luck-based game, there are spots you can fire that give yourself a slight edge. For example, if you see a 4 space opening, you know that a carrier (5 slots) cannot fit there. If you see a spot completely surrounded on all sides by misses, you know that there cannot be a ship in there, and as a result that space may as well count as a miss. This is what I coded into the AI in my battleship game to make it more human-like. By covering the board with probabilities of locations that are most likely to harbor a ship (pun intended) in relation to the amount of ships left and the different types they are, a player (or AI) can use game theory to their advantage to win the game.

To tie these together, game theory is a super broad study that has many different uses. Anything that requires an interaction between intelligent beings all comes back to this idea of game theory. Being able to mathematically predict the most possible and reasonable outcome is very useful in many occupations. I hope I can apply game theory to many different projects in the future. But for now, I will continue to improve upon the battleship game I am making right now.

## How Does Google Maps Work?

By Noah S., age 16

Hi! Today I’m going to be explaining graph theory, a complex algorithm used almost exclusively in object oriented languages that shines when you want to know the shortest path from something to another thing. Some applications include finding the shortest route to a location, the smallest number of moves to win a chess game, and the fastest way to solve a Rubik’s cube. Overall, graph theory has a lot of potential when applied to something large, and can really be utilized in many amazing ways. Without further ado, let’s jump right in.

Let’s try to visualize the graphs that are mainly used in graph theory. Imagine a bunch of points, with every single point connected to one or two other points. These are called nodes. Nodes are used in many other types of searching algorithms, such as linked lists and trees. Since there are two types of graphs used in graph theory, imagine these two scenarios. First imagine the same points and connections as stated above. This is an undirected graph. An undirected graph is when every line between the points is unmarked. To contrast, now imagine the same graph, but every line between the points has a direction, marked with an arrow. This shows how one node connects to another. Undirected graphs use unmarked lines to indicate that information flows both ways, while directed graphs use marked arrows to indicate information that flows only one way.

Now that we have understood what kinds of graphs exist, let’s discuss the ways they can be utilized. Commonly known as simple graphs, any graph without a clear pattern or shape, and doesn’t loop whatsoever is deemed so. Most graphs used in graph theory are simple graphs. Other types of graphs are non-simple graphs, which can be identified with their use of loops (for example, three nodes all pointing to the next node to form a triangle shape). Another type of graph is an isomorphic graph. These graphs are just simplified versions of the non-simple graph. Since many non-simple graphs end up showing some kind of pattern, usually it can be arranged to form a particular shape. Imagine a bunch of nodes all pointing to each other to form a pentagram or such.

There is one type of graph that stands out, however. It is the weighted graph. A weighted graph is just a normal graph with a catch: Every line that connects two nodes has a weight, usually an integer, of how much it “costs” to use this line. As a result, a path that connects two nodes might end up being longer than a path that goes through 4 or 5 nodes. Knowing the weight allows the algorithm to show signs of sophistication. For example, maybe you see a lot of traffic going to your destination. Weighted graphs allow you to determine the fastest route, and you may end up arriving there a few minutes earlier. Of all the graphs mentioned in this post, weighted graphs are the most complex, but the most fundamental in properly understanding and utilizing this code.

Now let’s talk about how to actually make this code work. If you have experience with linked lists or trees, or basically anything with nodes, it’s pretty simple to understand. You traverse through the graph, starting with node 1, you traverse through the graph (test out every option) until you hit your destination. Then, it calculates the fastest possible route. If weights are not present, it is simply the path with the least amount of lines. If weights are present, however, it will calculate which path has the least weight.

This is graph theory in a nutshell. There are some other small nuances and such, but knowing the stuff that I have written will set you pretty well off. Understanding graph theory will help you understand other object oriented algorithms, like trees and linked-lists (although I would start there if you have no coding experience with nodes).