data science platform

The Data Science Platform: Changing the Way You Think About Analytics

Why a data science platform? The status quo in data science is still that of the 80-20 rule. Data scientists spend 80% of their time on data wrangling. Only 20% of their resources are dedicated to advanced analytics. This includes building machine learning models and performing iterations on their ML models to account for changes in source data and keep the model accurate.

AI deployments have grown by 270% over the past four years. Yet data scientists and machine learning engineers still have a long way to go. According to a 2020 Figure Eight Report about the State of AI and Machine Learning, the laborious task of data preparation continues to be a challenge in almost any data-intensive setting. Over 62% of the report’s respondents say they are able to update and maintain their AI models only sometimes or never. 

80-20 rule data science
Image 1. The 80/20 rule in data science

Another bottleneck in data science initiatives is the communication gap between technical practitioners such as data scientists or machine learning engineers and line-of-business owners. The way out? The report puts it this way:

It’s clear that people in line-of-business roles and technical practitioners must do more to collaborate. By getting in the same room, the two groups can work to find common ground when it comes to their AI initiatives.

Working with data today

An average day in the life of a data scientist consists of identifying, collecting, cleaning, and aggregating data, modeling the prepared data and tweaking data models so that organizations can generate insights. The task of data preparation continues to be a challenge to data scientists.

Extracting and processing high-quality data is the essential prerequisite for any data-driven decision-making. Yet these tedious tasks take too much time, leaving little room for the truly creative part of data science: analysis, visualization, and the generation of powerful insights.

The ability to transform data into something useful is in high demand in today’s world. And instead of dedicating a staggering amount of time to managing, cleaning, and/or labeling data, data scientists should be updating their ML models and making progress on AI projects.

The data science platform as an enabler

screenshot of the data science platform with data pods and pipes
Image 2. The data science platform with data pods and pipes

Building on years of expertise in data warehousing, Record Evolution has developed an end-to-end data science studio that delivers a wealth of powerful services on top of classic data warehousing. The data science studio addresses the gap between the data preparation phase and the creativity phase in data science work.

Hosted on the cloud, the platform encapsulates the latest data warehousing know-how to supplement it with a comprehensive data science toolchain. Embedded analytics and the training of machine learning models take place on the platform so that users can make the most out of their data at any time.

The Record Evolution data science studio is a data-centric platform that aims to provide the best conditions for data to develop, grow, and transform into value.

Deriving the best insights out of the data, to the fullest extent possible, is a goal woven throughout every feature of the platform.

Like any digital platform, the Record Evolution ecosystem has a global coverage—access to the cloud is all that users need to connect to their knowledge for targeted insight generation. On the platform, users can automate an array of data extraction and data preparation tasks to turn these into reproducible processes. Powered by embedded analytics, users can perform ad hoc querying, and can optimize their data to serve as the basis for real-time decision-making.

A host of automatable loading, historization, versioning, monitoring, and control processes ascertains that basic data tasks are handled with guaranteed data quality and integrity. Users can focus on what a data science platform is about—extracting value out of unstructured streams of big data.

Becoming future-oriented

Visualization of the data transformation options in Repods.
Image 4. Data science platform snapshot with pipe settings

Data science platforms are sophisticated infrastructures with built-in toolchains and a host of integratable tooling. These enable users to build their own solutions and businesses upon them. As a facilitating interface, the Record Evolution data science studio thrives on the growing number of developers and data scientists sharing knowledge on the platform.

Available as a maximally flexible cloud-based service, the platform’s architecture allows for instant scalability. The unified platform serves as a hub for all user data and as a team meetup. This is also a venue for performing complex data tasks collaboratively or independently. Insights derived from the data are instantly sharable, allowing businesses to respond quickly and take action based on their data.

For many small and medium-sized companies, building a data warehouse remains a large investment with an unclear outcome. Shifting towards cloud computing, transitioning from “owning” to “using”, and taking advantage of scalable data science platform services is, therefore, a clear survival strategy for many. The transparent goal of remaining competitive in a volatile
market is what drives such decisions. It is about gaining the upper hand in a continual quandary of how to generate novelty within an innovation landscape dictated by large corporations.

With its democratic approach to data and open-ended architecture, the Record Evolution data science studio is not simply about retaining a competitive edge on the current data science platform market. Responding to current conditions only, even if this means using the most advanced technology available at a given moment, does not automatically pave the way towards a future.

Building a long-term data strategy

What organizations need is a combination of tooling and infrastructure with foresight. Users are then not only equipped to innovate now but can also continue to do so tomorrow. Whereas data science platforms enable users to keep current with the latest trends in data warehousing and data science, it is about becoming future-oriented. The platform gives its users a stable foundation on which they can continue to build on.

platform architecture overview
Image 5. Overview of the platform architecture

The Record Evolution data science studio enables the creation and maintenance of a long-term data strategy. On the platform, users consolidate data from diverse sources. You use that data in a wealth of applications across company functions. As a unified service, the platform keeps all heterogeneous data constantly updated within a unified repository. There, the various data types are kept in their native formats. From there, data can be accessed for a wealth of data tasks and for running a variety of applications.

All-round analytics

According to Gartner, by 2022, 90% of corporate strategies will handle information as a critical enterprise asset and will view analytics as an essential competency. The Record Evolution data science studio addresses this shift in the current innovation landscape by offering accelerated, unifying, and flexible data analytics for all.

Acceleration in data task handling

The data science studio thrives on a combination of the latest developments in data warehousing and advanced data science know-how. The data science platform empowers the data engineer in taking away the tedious workflow of data preparation, automating as much as possible, and allowing customers to focus on what matters most—extracting deeper insights out of their data.

Unifying data from disparate sources

data sources overview
Image 6. An overview of the available data sources

The platform supports data collection and analytics on data taken from a variety of heterogeneous sources and formats. Users can create a database connection and start importing data from their database, with the currently supported databases being PostgreSQL, MySQL, MariaDB, and MS SQLServer. Connecting and harvesting data from Amazon S3 Buckets, FTP Servers, IoT Routers, Twitter sources, and the web takes place over a user-friendly interface. Just as intuitive is the file import, supporting .csv, .tsv, .txt, .log, .tab, .raw, .data, .db, .dat, .json, .xlsx, .xls, as well as .zip files and adding the data directly to a raw table.

Heterogeneous data is consolidated into a place where users can leverage structured, unstructured, or semi-structured data from a plenitude of sources. The storage layer of the data science studio can handle massive amounts of diverse data including IoT data from sensors attached to devices, social media data, or log data from websites. The platform’s ingrained elasticity makes it possible to capture massive amounts of incoming data. It also safeguards the availability of that data at all times with automated continuous backups.

Historization and versioning with mELT

Loading and transforming data is one of the most time-consuming tasks in data analysis. For example, a file is provided too late, corrections and new data are all in the same file, or the data contents of several deliveries overlap. The platform’s mELT processes ensure that data imports are always correctly merged with the existing stock. This is possible because the target table has additional knowledge of the objects it contains. If a customer file is loaded twice, mELT recognizes which customers already exist in the target table and doesn’t make any changes the second time. If the second file contains corrections, mELT will update only the corrected data, maintaining strict version control.

Data modeling

Relationships between tables are an essential construct used to describe information in larger contexts. Truly valuable insights are usually obtained by combining many interconnected items within one analysis. Well-designed relationships between table entities are therefore an essential prerequisite for future data analyses. These will enable you to explore your data in depth. There is no need to follow a particular modeling scheme—the platform’s reporting engine can analyze any ER model.

Reports and workbooks

data science workbook
Image 7. A snapshot of the platform’s data science workbooks

To create a report, users simply navigate in the data model along the defined relationships and select attributes for the report. Users can drill through the entire data model and are not tied to particular data model schemas, as most OLAP tools require. For a better overview of the results of their reports or predictive models, users can inspect the SQL query created in the background. Together with the report, the studio automatically generates a suitable visualization of the results.

This is also your place for developing deep learning algorithms and artificial intelligence at large. Developers can write code and create documents in a workbook editor directly within the AI platform. In the editor, users can execute custom queries on their data model using the full power of PostgreSQL SELECT statements or use Python cards. Like any tool on the platform, the workbook editor is geared towards full automation. Users can have their workbook cards executed by the platform’s automation manager in a data-driven way. Other workbook environments such as Jupyter or Zeppelin and other BI tools can also be used as an alternative to the local workbook via the platform’s direct access interface.

Infographics

Specific, fine-tuned data visualization can be generated in the infographics environment. The in-platform infographics editor enables the creation of custom visualizations, descriptions, and interactive elements. This is where you craft powerful customized interactive graphics with animated features. With a little bit of code (d3.js), users can breathe life into existing templates and build their visualizations directly in the browser. The reports’ data is made available as a convenient embeddable link that can, among other things, be integrated into external customer dashboards.

Acceleration is woven within every aspect of the Record Evolution data science studio. As a data science enabler, the platform is streamlined for simplicity combined with high-speed analytics.

Collaborative knowledge distribution in teams

Various teams within an organization can securely access the same data and collaborate on the platform directly. There is no need to copy warehouses locally or move the data elsewhere. Rather than using locally copied data that is static and quickly becomes outdated, multiple users within a data science team can leverage the platform data in real-time. This is how you make sure you are accessing the latest update. Data sharing takes place across physical boundaries. This creates a global environment for analytics that changes the way we think of business contexts and intra-organizational collaboration.

Data sharing within a secure environment is the democratic ideal of every cloud data science platform. With Record Evolution, users within a data science project enjoy enhanced sharing capabilities and community-building features inspired by the user interface of social media platforms such as Twitter. Each user has a profile and a dashboard. These include a list of the Data Pods they own, co-manage, or access in a read-only function.

data pods internal structure
Image 8. An overview of the data pod internal structure

Users can follow the publicly available work of other users across the globe. You invite collaborators to work on a specific data warehouse. Further, you request access to publicly available use cases or assign user privileges of different types to your collaborators. This way, you create a completely private and invisible data warehouse for a select group of collaborators only.

Record Evolution encourages spontaneous collaboration for the benefit of knowledge and insight.

Data Pods

Data Pods are shareable in public and private mode. A public Data Pod is visible to all users globally. Users can think of public Data Pods as open-access repositories. These enable the free and unlimited transfer of knowledge and know-how across physical boundaries. This grassroots approach to data analytics foregrounds the free movement of data anywhere across the globe. It aims to be inclusive at its core.

Private Data Pods are suitable for company-specific tasks or the handling of sensitive data. Users can take ownership of their data by assigning access privileges and controlling who works with their data, and when.

Private Data Pods

  • work in a secure environment where data is protected from unauthorized access
  • control membership and access privileges
  • work in select groups with a full overview of who does what
  • generate insights on high-potential projects
  • exchange insights securely and privately with partners across the globe to speed up development processes

Public Data Pods

  • share interesting use cases
  • generate leads
  • create a following for your data cases
  • showcase work to collaborate remotely with other developers

Knowledge sharing: security, reliability, scale

Classic data warehousing, and especially off-cloud on-premises data warehouses, are extremely difficult to scale. This rigidity can be an obstacle when you need a rapid response to a change in market conditions. Organizations often incur tremendous costs for network, CPU cores, memory and storage resources. Yet these may remain completely unused within the framework of their data warehousing.

Scaling up or down as needed

Users can scale the amount of required compute resources either up or down, effective immediately. Users can determine the exact amount of compute power they need for each Data Pod. This is how you add flexibility and independence to each individual data warehouse on the platform. As compact lightweight data warehouses, Data Pods are optimized to process massive data loads reliably and without delay.

Geared towards maximal performance, the platform’s Data Pods are autonomous entities. Each Data Pod runs on its own infrastructure and compute resources. This means that the platform can handle multiple data loads and processing jobs simultaneously without hurting overall efficiency. Further, the compute resources for each Data Pod are user-specified. This enables users to leverage a scalable amount of compute power as needed. Scaling up enables users to use resources as their needs grow and as teams get larger.

Security

Data encryption, network monitoring, and access control are the basic constituents of a comprehensive cloud security offering. This is how you make sure your platform is fully compliant and committed to protecting user data at all times. In compliance with the necessary laws and regulations, the Record Evolution data science studio protects and stores user data in a secure environment.

The platform is geared towards establishing secure connections for all data harvesting and analytics processes. Strict authentication procedures encompassing cybersecurity measures are in place to protect user identity. Granular authorization practices enable the assignment of role-specific access and functions to protect the integrity of sensitive data. Security technologies such as data encryption upon ingestion, encryption key management, user-managed keys, and backups, make sure that data democratization takes place in a safe ecosystem.

  • Minimum operating effort: Less complexity due to automatic performance, security, and high availability.
  • End-to-end encryption: Automatic encryption of data in transit and in rest.
  • Comprehensive access protection: The security functions include several authentication mechanisms and role-based access control.
  • Strong data delimitation: Access to different data areas can be created by creating several Data Pods. Data can also be controlled via internal mechanisms in the Data Pod. This allows any organization to enforce strict safety and DSGVO guidelines for both internal and external data access.

A data science studio for collaborative analytics

The Record Evolution data science studio offers an end-to-end analytical data environment. You create a data warehouse in seconds. Processes across the entire data journey are automatable. The data science studio takes over the tasks of data extraction and data preparation. This way, it allows you to focus on what matters most—extracting actionable insights out of your data. And further still, the platform allows for granular multi-level collaboration across company functions. It virtually puts different specialists at different locations in the same room.  

Download this resource as a whitepaper. See how we can help you build a long-term analytics environment for your organization.