Why Is Data Driven ML So Amazing?

Francis Bacon, the father of the scientific method, proclaimed that knowledge should not be based on pre-conceived notions, but rather based on experimental data. Traditional hypothesis-driven, or deductive, reasoning is limiting in that it relies on a premise set before the experiment, thus restricting the result to correspond to that initial premise. Inductive reasoning, on the other hand, utilizes facts to generalize a meaningful result.

For example, if every known teenager owns a mobile device and insurance is never sold to a minor (<18 years), we may conclude or infer that none of these devices could be insured.  Incorporating additional data points, learning that 10% of high school mobile devices are indeed insured, leads us to infer that their parents are purchasing the personal property insurance. From a business application standpoint, we can immediately develop a targeted marketing campaign to parents for insuring their children’s devices. With this example, we can see that induction generalizes and extrapolates conclusions and determines a probable result/outcome, but is never 100% certain. Induction requires active tuning and analysis of results to provide a more refined trend with each new piece of information incorporated.

The rapidly growing speed and strength of technology has allowed for the evolution of inductive, automated data mining, with a hypothesis-neutral approach, leading to numerous new discoveries by yielding new correlations, patterns, and rules.

According to some Big Data advocates, “inductive reasoning generally produces no finished status. The results of inferences are likely to alter the inferences already made. It is possible to continue the reasoning indefinitely. The best inductive algorithms can evolve: they “learn”, they refine their way of processing data according to the most appropriate use which can be made. Permanent learning, never completed, produces an imperfect but useful knowledge. Any resemblance with the human brain is certainly not a coincidence”.

Currently, the widely-accepted scientific method we learn in school is to formulate a hypothesis, test by experimentation, and attempt to prove or dis-prove the hypothesis. Experiments require an identification and testing of each specific hypothesis, before moving onto the next. For example, to understand why New Yorkers have the highest customer retention rates, we first test a hypothesis that it is age based, then one which is gender based, then employment based, etc. This is a slightly biased approach as we must choose one risk at a time and complete the experiment related to that risk before moving on.

By contrast, technology is now enabling us to process massive quantities of data simultaneously to have a less biased approach. We can conduct data mining and identify differences between two populations (i.e. New Yorkers and residents of other regions) based on many criteria at one time. While it is important to utilize large sets of data and re-test examples as false-positives can arise with a small sample set, as long as the sample set is large enough to account for these errors, this data driven, inductive approach will yield better results in a shorter timeframe.

Hypothesis or Model-Driven

Generate hypothesis, and test to confirm or dis-prove based on analysis of experimentation/data.

  • Limits complexity
  • Relies on deep understanding of system/process
  • Must be simplified, cannot accommodate infinite complexity
  • Cannot account for noisy data and non-included variables
  • Limited by complexity scientist can manage
  • Requires experts (subject matter, technology, consultants)
  • Trial-and-error approach, which requires time to discover suitable model and refine until desired results are produced

Data-Driven (ML) Induction

Utilize algorithm to identify connections and correlations that may not have been suspected.

  • Requires significant volume of data to produce results/outcomes
  • AI tools learn based upon examples, that need to cover full spectrum of expected variations and null cases
  • Require Big Data to achieve meaningful results