data science platforms image

The Essential Capabilities of Data Science Platforms in 2020: An Overview

In March 2020, Gartner published its report on the critical capabilities of data science and machine learning platforms. The report looks at the offerings of 16 vendors currently dominating the market. These are rated on the basis of criteria such as data access, data preparation capabilities, data exploration and visualization, delivery, collaboration, and scalability, among others. 

In what follows, you will find an overview of the key critical capabilities that Gartner addresses in its report along with basic definitions. Also, you will find out how the data science platform Repods fits into this picture and how it incorporates these capabilities. 

And here is the list of the essential capabilities of data science platforms identified by Gartner across a variety of platform vendors:

Data Access, Broadly Conceived

The first essential component of data science platforms relates to a platform’s capacity to extract and integrate data from a variety of heterogeneous sources. These may include on-premises sources as well as data located in the cloud. 

Users should look out for data connectors as well as built-in capabilities for the import and consolidation of data from different sources without the use of external tools. The platform should be able to handle increasingly large and distributed volumes of diverse data.

According to Gartner, the essential features that a data science platform needs to cover are these:

  • Access to different data types
  • Multi-cloud and hybrid data sources 
  • Basic Extract, Transform, Load (ETL)
  • Integration of web data and IoT data as data sources
  • Data refresh and synchronization
  • Real-time data feeds
  • Data governance and metadata management
  • Data lake support
  • Enterprise application access
  • Hadoop and NoSQL access
  • Data lineage

Find out more about the data import options of the data science platform Repods, including import from files, IoT routers, Web and Twitter sources, FTPs, S3 Buckets, and external databases.

All-Round Data Preparation

This term captures the capacity of a data science platform to explore the data in depth, perform data cleansing or data transformation tasks, and combine the data in a flexible way. Data science platforms are expected to have a built-in infrastructure to support various data transformation jobs. Some platforms may even offer suggestions to ensure data quality or to perform dataset partitioning.

As data preparation is traditionally seen as a laborious manual task, a viable data science platform is expected to allow for the automation of most manual work. 

Below are the data preparation capabilities that Gartner identifies as essential to data science platforms:

  • Blending, binning, smoothing
  • Transformation, aggregation, and set operations
  • Data cataloging, data labeling, and data annotation
  • Machine learning and algorithm data preparation
  • Search and filter options

Find out more about transforming data with data pipes in Repods or creating a data model on the platform. 

Built-in Data Visualization

This is one of the capabilities of data science platforms referring to the way a data science platform allows its end users to explore the data and interact with it. This involves basic reporting tasks, statistical analyses, the detection of various patterns in the data and the spotting of correlations. You will also have a choice of data visualization variants. The latter includes interactive dashboards and/or charts that are updated in near real-time, as well as the possibility to create custom visualizations, such as custom interactive infographics with d3.js.  

Gartner identifies the following key capabilities to look out for:

  • Augmented data discovery
  • Univariate and bivariate statistics
  • Statistical significance testing
  • Clustering and self-organizing maps
  • Geolocation mapping
  • Affinity and graph analysis
  • Conjoint and survey analysis
  • Density estimation
  • Similarity metrics

For more information about analyses and reporting with Repods, see here. For information on how to create infographics on the platform, check out this page

Machine Learning (ML) & Advanced Analytics

Machine learning platforms traditionally support multiple models out of the box or offer the option of custom coding. Typical capabilities may include the import and development and testing of predictive models; deep learning, neural networks, reinforcement learning, transfer learning, regression, time-series analysis, Bayesian modeling, classification and regression trees, ensembles, or hierarchical models, to name some examples. 

Then we have the ability of a data science platform to integrate additional statistical methods as well as optimization, simulation (predictive analytics), and other analytics into the development environment. Optimization functionalities may include solver and heuristic approaches, as well as the design of experiments. Simulation functionalities involve building a model to study its behavior and gain insight on possible outcomes. 

In Repods, you code in data science workbooks with the option to choose between PostgreSQL and Python cards. You can also work with Jupyter or Zeppelin via the platform’s Direct Access interface. You have markdown cards enabling you to provide stylish documentation snippets to your workbook using a simple markdown format. Find out more about the data science workbooks in Repods. 

Flexibility and Openness

Broadly conceived, this involves giving data scientists the freedom to use their favorite methods and tools when working on the platform. Data science platforms need to be able to support the most relevant tools, languages, libraries, and frameworks as well as various other open-source offerings.

This means support for languages such as Python, R and Scala; Jupyter or Zeppelin notebooks; visualization tools such as d3; open-source machine learning tools and data management platforms such as Spark and Hadoop; Docker; as well as the general ability to integrate third-party offerings. 

For a tentative list of services you can integrate into Repods, see here.

Ease and Speed of Delivery

Here we are covering the ease and speed with which you move models from your developer environment to a production environment. Data science platforms should enable you to create APIs or containers for faster deployment, for example. 

For more information on external access and API reference in Repods, see our documentation.

Platform Management and Collaboration Capabilities

Platform management covers questions of security (e.g. data encryption), compute resource management, data governance, the version management and reusability of projects, as well as auditing and reproducibility.

In terms of regulatory compliance, data science platforms should be able to facilitate auditing and respond to regulatory challenges. Platforms are expected to offer runtime optimization, multi-user capabilities in the sense of project management, as well as debugging and logs. 

The multi-user management aspect leads us to what counts among the indispensable capabilities of data science platforms: collaboration. This includes the platform’s ability to support different types of collaborations and to facilitate workflows and projects for teams at different locations and in different departments of an organization. This involves collaboration between data scientists, engineers, business analysts as well as nontechnical business specialists and users. Key here is the process transparency made possible by collaboration. 

To find out more about the collaborative features of the data science platform Repods, see here.

User Interface (UI) and Coherence

This criterion encapsulates the visual design of a data science platform. Here we look at how intuitive and responsive a platform is to its end users. The aim is to achieve a coherent “look & feel” for a maximally user-centric experience. UIs may be optimized for developer-focused data science but may also involve drag-and-drop workflow creation that lowers the entry barrier for non-expert-level data scientists. The key aspects here include the ease of use and learning curve, contextual aids, the platform documentation, the consideration of user communities for enhanced collaboration, and even customizable algorithms. 

The coherence criterion includes the consideration of how consistent, integrated, and intuitive is the data science platform in terms of the big-picture data science process across a variety of user types. This involves usability, delivered as a seamless end-to-end experience. But it also considers the general flexibility of the platform solution, as well as the speed, level of standardization, and the consistency of the platform’s “look & feel”. 

Find out more about the data science workbooks in Repods. 

Automation

This involves the ability to automate tasks and augment the iterative search of models from a set of pre-established candidates. According to Gartner, the functionalities that data science platforms should be addressing include:

  • Data preprocessing, preparation and discovery
  • Feature learning and feature engineering
  • Dimensionality reduction and feature selection
  • Algorithm selection
  • Model tuning, deployment, and monitoring

More information about the various options to automate, monitor and control processes on the Repods platform can be found in the Monitor & Control section of the product documentation. 

Performance and Scalability

This criterion considers the time it takes to load data, create and deploy models; the ability to iterate, and, broadly speaking, the general ability to provide faster, real-time insights to facilitate decision-making processes within an organization.

Key considerations include big data volume scalability, real-time data, cloud computing capabilities, in-memory computing, support for GPUs and other specialized hardware, among others, as well as comprehensive cost guidance.

Insights from Platform Users

Gartner identifies the users of data science platforms as data scientists, data engineers, statisticians, and citizen data scientists. To include user sentiment in its research, the analyst firm is supplementing its classic reporting with the Gartner Peer Insights. This is a tool for IT and business leaders to find out more about the products available on the Data Science Platform and Machine Learning Platform market.

The Peer Insights includes comprehensive ratings and review pages for each platform offering. The rationale, according to Gartner, has been to provide “an enterprise-grade reviews and ratings platform that is open to everyone”. 

According to these user reports, there is a clear preference for platforms that allow data science teams to build their own custom solutions, as opposed to exclusively purchasing packaged applications or outsourcing the data science work to a service provider. And this is exactly what we do!

As a service about to enter the marketplace, we are excited to learn what you think of the data science platform Repods. To explore the possibilities of the platform, sign up and create a free Data Pod with a full set of demo data. Your free Data Pod will always remain free and you can create as many as you want!


Subscribe to our blog!
Stay up to date with the latest data science and IoT tips and news.