October 6, 2016

Skytree: data science in the hands of the masses

The Center for Data Sciences own Vasant Dhar has said the problem with data science is no longer collecting data, but finding the right tools to analyze that data. Skytree, a machine learning software presented by Nick Ball and Alexander Gray at the Moore Sloan Data Science Environment, touts itself as making advanced data mining of large datasets available to everyone, not just data scientists.

Data mining is the practice of analyzing large datasets, to generate new information or ideas; it has been used by medical companies to identify at-risk patients earlier, and by advertisers to produce better target ads. Skytree offers several levels of analysis for varying levels of proficiency, allowing people without specialized knowledge of data science to to take advantage of data science methods. Balls and Grays Moore Sloan presentation stated that the software can be used in several ways:

  • By entry-level users, such as business analysts without previous data mining experience, through a Graphic User Interface
  • Intermediate users such as developers and modelers, via an Application Program Interface
  • Advanced users such as data scientists and IT administrators via the command line

Skytrees claim to making data science methods available to the masses is given credence by the different companies in various fields making use of the software. Credit card networks have used the software to improve their fraud detection; Skytrees speed, and ability to update their model daily, resulted in s 5% increase in fraud detection.

The software has also been used by utilities companies to predict and prevent energy diversion; by medical device manufacturers to develop a failure prediction system to warn when components are likely to fail; and by online firms for churn prediction, which identifies dissatisfied customers usage patterns and allows firms to make adjustments for said customers.

While each specific use case will remain problem-driven, the underlying tools are not dataset-specific,stated Ball and Gray in their presentation. Thus, the installation of Skytree on the ones computing infrastructure makes possible the practical use of these algorithms by users who are not data mining specialists.

Skytree asserts that its use of a broad spectrum of algorithms also yields better results, as there is no single best machine learning algorithm, so best results require multiple methods.

Skytree also claims its NlogNmakes machine learning on large data sets more feasible within the same amount of time. Their presentation stated that algorithms that otherwise scale as, e.g., N2, for N objects, are implemented to scale linearly, without loss of accuracy.

The final piece of the puzzle is Skytrees compatibility with a wide variety of hardware.  The software can run on any environment that runs Linux software. This means it can be run on setups as simple as a standalone desktop, as well as more advanced systems like distributed clusters, and multiple Hadoop distributions.