Finding better boundaries
Let’s revisit the 240-ft elevation boundary proposed previously to see how we can improve upon our intuition.
Clearly, this requires a different perspective.
By transforming our visualization into a histogram, we can better see how frequently homes appear at each elevation.
While the highest home in New York is ~240 ft, the majority of them seem to have far lower elevations.
Your first fork
A decision tree uses if-then statements to define patterns in data.
For example, if a home’s elevation is above some number, then the home is probably in San Francisco.
In machine learning, these statements are called forks, and they split the data into two branches based on some value.
That value between the branches is called a split point. Homes to the left of that point get categorized in one way, while those to the right are categorized in another. A split point is the decision tree’s version of a boundary.
Picking a split point has tradeoffs. Our initial split (~240 ft) incorrectly classifies some San Francisco homes as New York ones.
Look at that large slice of green in the left pie chart, those are all the San Francisco homes that are misclassified. These are called false negatives.
However, a split point meant to capture every San Francisco home will include many New York homes as well. These are called false positives.
The best split
At the best split, the results of each branch should be as homogeneous (or pure) as possible. There are several mathematical methods you can choose between to calculate the best split.
As we see here, even the best split on a single feature does not fully separate the San Francisco homes from the New York ones.
To add another split point, the algorithm repeats the process above on the subsets of data. This repetition is called recursion, and it is a concept that appears frequently in training models.
The histograms to the left show the distribution of each subset, repeated for each variable.
The best split will vary based which branch of the tree you are looking at.
For lower elevation homes, price per square foot is, at X dollars per sqft, is the best variable for the next if-then statement. For higher elevation homes, it is price, at Y dollars.
via Hacker News http://ift.tt/1UrpbXU