Discover the IoT App-Store and Connect your devices to the world of IoT

Data Strategy & Analytics

data preparation paradigms

What is the Difference between Delta Processing and Full Processing?

The ability to query data, generate compelling data visualisations and insights are all the results of having a solid extract, transform, and load processes as part of your data studio. Of these, full load processing and delta processing are the most prominent ones. Full load processing means that the entire amount of data is imported iteratively the first time a data source is loaded into the data studio. Delta processing, on the other hand, means loading the data incrementally, loading the source data at specific pre-established intervals. 

But how does data preparation work? First, the initial raw data is extracted from various data sources in its original format. Second, it gets transformed into a unified format while performing cleansing tasks such as removing duplicates or faulty data and making sure that data integrity is maintained at all times. The third step is staging, meaning that the data enters a temporary location from which it can be easily rolled back if errors are detected.

We reach the last step when the data is loaded into target tables (or core tables) and becomes part of long-term data histories. The data history includes both existing data and new data that is continuously being loaded. This usually involves rigorous testing across the data pipeline and the application of data processing techniques. 

data science workbooks

Explore the Data Studio by Record Evolution

Various data preparation and processing techniques are in place to make sure that the data that gets into the long-term data histories is complete and free of error. Full load processing and delta processing are the most prominent ones. In this article we’re going to have a look at these two approaches to data preparation and processing in more detail and pinpoint the advantages and disadvantages of both:

 

Full Processing (Full Load) 

Full processing is always executed on the full raw data set. All processing steps behind the raw data layer are considered temporary. They can be recovered from the raw data at any time. Full processing processes include every transformation step, including every historical transformation step. 

full processing in paradigms for data preparation
Image 1. Full Processing

The Advantages of the Full Load Approach

  • Code versioning corresponds to data versioning.
  • Retroactive changes can be implemented with ease.
  • The transformation processes are simpler.
  • The approach allows for easy traceability back to the raw data.
  • AnalysisDB can be skipped if necessary and marts can be created directly from the raw data.

Some Disadvantages of the Full Load Method in Data Processing

  • Often, the full load approach means that data processing can get extremely slow and requires high computational load. Because of these limitations, the data processing frequency is often only possible on a weekly basis at best.
  • Corrections are implemented slowly as the smallest data error requires a new full load.
  • Release-like provisioning process. 
  • The raw data history must be permanently held for access.
  • If version levels are required for access, multiple analysis DB must be created. This leads to many data redundancies.
  • No single source of truth for analysts in the analysis DB.
  • In case of direct marterization, many transformations are executed multiple times. This can lead to substantial delays. 
  • Ultimately, this approach results in extremely high total data volumes.

Instead of pulling the whole data out of a table time and time again, only the new additions and the modifications are being extracted. 

Delta Processing

How does delta processing work? The target tables are updated only with the modified or entirely new data records. So with delta processing, we do not take the entire data from a table but only the new data that has been added since the last data load. In delta processing, the system recognises which rows within a table have been pulled already, which ones have been modified, and which ones are entirely new.

In delta processing, all processing steps behind the analysis layer are considered temporary. They can be restored from the analysis layer at any time. With this approach, only the current transformation steps are picked up and executed only on the new data packages. Delta processing also provides functions for retroactive correction of the dataset in the analysis layer.

delta processing in paradigms for data preparation
Image 2. Delta Processing

The Advantages of Delta Processing

  • Fast loading times with significantly less computational load. 
  • Fast corrections and adjustments in the analysis DB. 
  • The approach makes it possible to perform high-frequency loading up to stream processing (real-time processing).
  • Multiple version statuses can be efficiently mapped in one database.
  • Raw data can be archived without loss of history.
  • The approach provides a “single source of truth” for all analysts in the analysis DB.
  • Ultimately, this approach results in lower total data volumes.

The Disadvantages of the Delta Method

  • High complexity: For all its advantages, this method involves complex loading processes as delta matching with inventory data is required. 
  • Delta loading processes are not suitable for full processing. Initial loading requires additional technically optimised full processing, otherwise, initial loading is comparatively slow.
  • Code versioning is not sufficient and additional data versioning in the analysis DB is required.
  • Traceability back to the raw data is somewhat more complex.

In a nutshell, incremental loading is significantly faster than the full processing (full load) approach. At the same time, delta loading is more difficult to maintain. You are not able to re-run the entire load in the case of an error the way you can with full processing. 

All things being equal, companies continue to prefer bulk loads and are focusing more on real-time stream processing dictated by current business objectives.

The goal continues to be simplification, meaning a sustained interest in faster loading times and reducing complexity. 

With the Data Studio by Record Evolution, you extract data from a variety of sources, automate load jobs, and integrate with any of your favorite tools thanks to our RESTful API. Data visualisation, analysis, and building ML models can also take place directly on the platform to generate insights fast while using the data platform as a unified data hub within your organisation.

Record Evolution Logo

About Record Evolution

We are a data science and IoT team based in Frankfurt, Germany, that helps companies of all sizes innovate at scale. That’s why we’ve developed an easy-to-use industrial IoT platform that enables fast development cycles and allows everyone to benefit from the possibilities of IoT and AI.

What is the Difference between Delta Processing and Full Processing? Read More »

PDF exports Tableau Server

How We Optimize Tableau Server as a PDF Production Engine

For over a decade, Tableau has been one of the big players in the realm of analytics and reporting platforms. The Tableau Desktop application helps BI developers yield well-groomed visualizations. Tableau Server is an enterprise-level platform to share reports and collaborate on reporting projects. The server is aimed at the dynamic interaction of its users, e.g. filtering data to observe a specific dashboard presentation. At the same time, many sectors still rely on the documentation of results as static reports (PDFs or CSV files). For example, this applies to areas such as bookkeeping and yearly inspections. In what follows, we consider a common use case for reporting businesses: how to optimize the production of 1000s of static copies of dynamic PDF reports produced by the Tableau Server.  

We present the experience we acquired from a client project we recently closed at the credit rating and risk management firm parcIT. The project entailed data engineering and analytics in a Postgres data warehouse, reporting (a dedicated data visualization design in Tableau Server), and automation (application development with Node.js and Docker, with a clean frontend written in Javascript, HTML, and CSS). 

In cooperation with our client, we optimized the production line, presented the data respecting their corporate design & banking guidelines, and secured the quality of the reports via automated tests. The target of the project was to produce 1000s of reports (e.g. credit portfolio, credit rating, model comparison) for one of the biggest bank groups in Germany: Volksbanken und Raiffeisenbanken, as well as various private banks. 

How About Scaling? 

The scaling of existing reporting solutions relies on the resources of the Tableau Server. This is often quite flexible as Tableau Servers run not only on bare metal Windows machines but also in clusters as virtual machines. Additionally, a containerized Linux implementation is also possible to set up but trickier to accomplish compared to the Windows version. This suggests that an in-scale production of static reports would be easy via an upgrade of the machine. However, there are several lessons to learn here before you can conclude that your machine needs some boosting.  

First Steps to Optimization

There are several other ways to optimize the report performance (i.e. production time). If you are dealing with millions of rows and about 50-100 columns, you should separate the data sources per dashboard. This is how you improve the dynamic performance. It is a good idea to choose hierarchical global filters to set up the context once for all dashboards. 

As a second improvement, the data sources must be extracted as a Tableau Hyper data system (this is a native NoSQL technology that the Tableau server employs to keep track of calculated fields). Live data sources may be necessary for some use cases. However, if your data is bulky, the live connection update will take ages to be served. Beyond this, one can keep track of and remove the unused columns, and keep the data flow as clean as possible. After all these, how about static reports?

The Issue with Tableau and PDFs

The Tableau software was not built for the mass production of static reports such as PDFs. So be ready for pitfalls if you need to process millions of data rows presented on 10s of dashboards that are based on 100s of workbooks in order to produce 1000s of PDF reports. One can think that this would be a trivial task for a conventional Windows Server with 64 GB Ram and Intel 8 Core x 2GhZ CPU architecture. This hypothetical Windows Server machine is powerful enough as long as the reports are served dynamically to a limited number of users (order of 10s).

However, when dealing with PDFs, the internal optimizations of Tableau work against you. Let’s first understand what we are dealing with under the hood. 

Building a reporting framework - PDF exports Tableau server
Figure 1. Building a reporting framework: an overview

The Status Quo

There are mainly two ways to produce PDFs, PNGs, and CSVs from dynamic reports via filters e.g. reporting year (Y), company  (C), and business case (B). Under the conditions given in the introduction, the production time will scale from 10s of seconds to a few minutes per report. This is not ideal, especially knowing that you may have to reproduce the whole batch a few times in case of errors.

One way to produce a PDF is Tabcmd, which is a command-line tool for Tableau Server. It is installed on your local machine and it sends web requests to the Tableau Server to return a PDF report. The second way is the REST API that runs on the Tableau Server. It downloads the resulting PDF file to your local machine. Both can be automated by common tools such as Python or Node.js applications to produce multiple reports running through the mentioned filters: Y, C, and B. 

REST API vs Tabcmd

There is one important difference between the two methods. That is the full import option vs. single dashboard import. This means that working with REST API, you need to download the pages one by one and merge them to produce the full report. In the case of Tabcmd, on the other hand, the reports will come as a whole. This includes additional pages to discard from your report later. Both methods can come in handy in the context of different cases. For example, many additional test dashboards may require page picking. 

There is also one shared glitch of both methods structural to the server itself. That is to say, when dealing with PDFs, the internal optimizations of Tableau Server work against you. Tableau uses caching for the dynamic optimization of internal queries (which is not possible to turn off, unfortunately). In time, this will make the imports slower by a rate of about 0.5% per report. No problem for 10s of reports, obviously, but the slow-down sums up to 500% for 1000 PDFs

This poses practical problems. The production time may be too slow to continue at some point. It will interfere with the other users’ activities. And the backend application might drop the database client connection depending on the idle time options. This is problematic for the enterprise-level automation of the reporting. 

PDF production benchmarking Tableau server
Figure 2. PDF production benchmarking

At this point, we assume that workflow automation is managed by a backend that connects to a DB and a Tableau Server. The pool connection to both servers provides the necessary data, meta-data, and finally logs the production meta-data to a DB. This being said, we assume that the bottleneck is still the Tableau server PDF production. And under the given conditions above, it still will be. Otherwise, you need to fix the bugs of your application first, say DB queries or workflow management. 

The Takeaway

Our observation is that the REST API performs much better than Tabcmd in the case studies mentioned above. This means in about the order of 2-3 times. This is not an order-of-magnitude improvement, but the difference between 6 hours vs. 18 hours matters when we consider 8-hour workdays. Remember that Tableau Server was not designed for PDF mass production. Reducing the production time under 10 secs per report is a very intricate task. This might require a real boost to your existing computational resources (presumably 2x to 4x). 

If your purpose is to get your PDF reports mass-produced, we can guarantee to improve the production mechanism for you.

Disclaimer: We are missing a true per computational resource efficiency argument. REST APIs are really greedy. This means that while Tabcmd would use only a percentage of the CPU, REST API would almost always claim the full power allocated to computations. This means all CPU resources minus the background processes. 

Get in Touch for Projects 

At Record Evolution, we have been consulting on data science and IT projects for many years. We help credit reporting companies enhance business insights using state-of-the-art visualization tools such as PowerBI, Tableau, and Qlick, all of which can be customized and extended using native Extension APIs. 

Get in touch to get all the details on implementing Tableau reporting tools to get the most out of your data visualizations.

Taskin Deniz, Ph.D.
Senior Data Scientist
Record Evolution GmbH
✉️ Contact us

How We Optimize Tableau Server as a PDF Production Engine Read More »

Tableau extensions consulting

How We Add Value to Reporting Businesses with Tableau Extensions

Data-driven decision-making is an essential pillar of business development and investment. It enhances the communication of good ideas, makes insights tangible to non-experts, and creates trust between clients and service providers. And surely, reporting businesses can only benefit from clean data pipelines and well-groomed data visualization. In what follows, I outline our perspective on how reporting businesses can enhance insight generation with customized dashboards.

The most important aspect of reporting visualizations is that they should be tailored to the needs of the target audience.  For example, financial and insurance risk analysts are familiar with crunching numbers and presenting reports to executives as a digestible product. These reports can cover numerous topics, various time intervals, and vast geographical data. For instance, the time interval of the credit reports can vary from yearly portfolios to hourly stock price dynamics. Additionally, they can cover the businesses of several institutions all over the world or just one bank in a given region. The granularity of the data presented in reports is only one of the aspects that demand flexibility and automation.

Introducing Tableau Extensions

There are various out-of-the-box solutions on the market to answer this demand. They provide the required flexibility and automation at a reasonable cost and with minimum staffing. In this article, we focus on what we can do with Tableau. The analytics and visualization platform Tableau has been catering to the needs of reporting firms for over a decade. Tableau is a dashboard-style presentation tool with lightweight data analysis options with sample code and various resources for the Tableau developer. It has a fast initial learning curve, it is well adjusted to various data sources and file types, intuitive to use, and fun to experiment with.

Tableau Extensions with Record Evolution
Image 1. Tableau’s Extensions Gallery

On the one hand, the Tableau software provides several figures and table types and formatting options for visualization. On the other hand, it is mainly a click-and-drop tool that has its limitations. For example, a live feedback widget on your risk assessment reports cannot be built using any of Tableau’s click-and-drop features. Luckily, there is a way to easily develop a custom-tailored app that can be embedded into the Tableau visualizations dashboard by using the Tableau Extensions API

Where Custom Extensions Come into Play

Tableau Extensions provide a headless browser-like interface in dashboards and an API to embed an application that can write/read data to/from the existing data sources and dashboards or communicate to a database. In a nutshell, it can accomplish almost all tasks that modern browsers can implement (i.e. similar to Chromium).

Customization with extensions

The customizations with extensions can provide the missing pieces of your report to address all the needs of the client. This includes comments by Tableau users, questionnaires, new visualizations, spell-checking, and custom testing, among others. (You might find an application for your needs on the Tableau Extension Gallery.) Dealing with extensions is not as trivial as dealing with click-and-drop features, however. It may require years of experience in web application development, depending on the complexity of the case.

Taking a concrete use case to show the power of extensions, one might look at data quality tests (DQTs ) for the data presented on dashboards. If the data source is an SQL server, the extension backend can be customized with SQL queries to create quality tests on the fly. This is an intricate job if one uses alternatives instead—either Tableau Rest API or Tableau Command Line Utility. 

Custom visualizations with Record Evolution
Image 2. Custom visualizations

Once developed via the Extension API, on-dashboard tests can reduce non-trivial human errors, help the analysts debug data source queries, and increase the scientific precision of the reports. As a result, decision-makers at any level benefit from a trusted source of information that helps them decide either to invest aggressively or retract from a potentially destructive decision.  

Custom visualizations

Another example is custom visualization: say a tree-type graph for company mergers over the course of several years. Such a graph is not a native figure type in the Tableau dashboard. But using well-maintained Javascript libraries (e.g. D3), an experienced developer can build the graph to enhance a manager’s view into the data. A Tableau Extension can provide a click-and-extend type of a tree graph with the merge information presented as a table upon hovering on nodes. The merge graph can be used as a dynamic filter to choose the data to be presented on a dashboard.

The Value Created by Extensions

Depending on the needs of your company or clients, custom enhancements of reports have the potential to add significant value to your business.

At Record Evolution, we have been consulting on data science and IT projects for many years.  We help credit reporting companies enhance business insights using state-of-the-art visualization tools such as PowerBI, Tableau (Tableau Server), and Qlick, all of which can be customized and extended using native Extension APIs. 

Get in touch to get all the details on implementing Tableau Extensions to get the most out of your Tableau data visualizations.  

Taskin Deniz, Ph.D.
Senior Data Scientist
Record Evolution GmbH
✉️ Contact us


How We Add Value to Reporting Businesses with Tableau Extensions Read More »

data architecture image

Your Data Architecture: Simple Best Practices for Your Data Strategy

If you accumulate data assets on which you base your decision-making as an organization, you most probably need to think about your data architecture and consider possible best practices. Gaining a competitive edge, remaining customer-centric to the greatest extent possible, and streamlining processes to get on-the-button outcomes can all be traced back to an organization’s capacity to build a future-ready data architecture. 

In what follows, we offer a short overview of the overarching capabilities of a modern data architecture. These include user-centricity, elasticity, robustness, and the capacity to ensure the seamless flow of data at all times. Added to these are automation enablement, plus security and data governance considerations. These points form our checklist for what we perceive to be an anticipatory analytics ecosystem.  

Defining data architecture

You need to build a data analytics ecosystem that is attuned to your organization’s commercial strategy. It also has to fully align with your specific requirements when it comes to managing large volumes of data. Think of your data architecture as an interface between business goals and technical processes within an organization. You have a set of tools and practices used to manage your data pipeline. Then, you supplement these with processes aiming to transform big data. And you deliver it in an insight-ready form to those who will consume the outcomes.

Data architectures, therefore, have to start with the data consumers and prioritize their perspective. You have to be clear on specific consumer requirements such as speed and availability. You need to think about the order of magnitude — the data volume may be crucial for deciding on the final enterprise data architecture. But you also need to be aware of the scalability options and the level of automation required for your particular scenario.  

A data architecture is not a data warehouse

A data warehouse is an IT-centric formation. Whereas a data warehouse may be part of data architecture, it remains just one constituent of something more complex and expansive than a mere warehousing solution. Today’s data warehouses have become more flexible and may also fit well into the requirements of a contemporary analytics ecosystem. This overarching term, according to Wayne Eckerson, encapsulates a novel understanding of data architectures whereby the “new data environment is a living, breathing organism that detects and responds to changes, continuously learns and adapts, and provides governed, tailored access to every individual”. 

A data architecture is not a data platform

A data platform acts as an enabling entity. It builds on underlying database engines to gather and combine data coming from various sources. A data platform, therefore, is a hub for integrating heterogeneous data. This is where you can perform transformations, analytics, create reports and visualizations. A data platform facilitates the complex movement of data thanks to its built-in functionalities. For example, this includes engines and a toolchain that perform the data processing and prepare the data in an insight-ready form that can be consumed by the decision-makers within the organization. 

From an enterprise architecture perspective, data platforms are part of a continuum. Hereby data-related technical processes are interlinked with business rationale and vice versa. The concept of data architecture additionally incorporates the business goals and stakeholder values building up the data strategy of an organization. 

Data architecture best practices: our checklist

Implementing an end-to-end digital data architecture requires, first and foremost, an assessment of your key use cases and a careful look at future business requirements. In the first step, you need to revise your existing best practices. Look into use cases to determine which processes and values have been conducive to their success. Something to consider is how your use cases work within the broader context of your market strategy and the business needs you are pursuing. It is only after you have reviewed these business-specific realities that you can concentrate on building your data architecture. 

Let us see which are the focus features of a viable, future-ready, and good data architecture:

Focusing on user-centricity

Data architectures need to start with the very business users in mind. The data itself, the underlying technology facilitating ETL processes, data transformations, analytics, reporting, and visualization are all secondary to inherent business requirements and the users behind them. Cultivating user-centricity is just as central to the success of a data architecture as the ability to grow and evolve together with the needs of business users. 

Safeguarding flexibility and elasticity

Data architectures have to remain maximally flexible to adapt to volatile business necessities. As they need to serve a variety of users, data architectures need to provide a versatile catalog of features, capabilities, and integrations that can make them adapt to a breadth of business cases and market conditions. Further still, architectures need to be elastic and scalable. They need to keep current not only with business realities but also with dynamic data processing requirements. 

Ensuring a seamless data flow

Managing and maintaining the constant influx of high volumes of data is one important requirement for data architectures. Your data journey, from the source that is harvested to the business consumers, has to be seamless and maximally streamlined. The data architecture carries and transforms the data via various pipelines. The interconnected pipes are constructed out of data objects that can be re-adapted and re-utilized in a variety of new scenarios. This is how they serve the changing needs within the organization. This guarantees that users get their insight-ready data at the end of the day. 

Automated, with built-in intelligence

A seamless data flow can be achieved via automated processes with built-in real-time anomaly detection and alert triggering mechanisms. So on top of your data architecture, it is best to have machine learning/AI to keep the data movement. AI adds to the elasticity of data architectures as it enhances learning capabilities. This is how you enhance your data architecture’s capacity to adjust and respond to changing conditions. 

Considering security and data governance

A data architecture must be fully compliant with privacy regulations and data protection laws such as GDPR. All data should be encrypted before ingestion and personally identifiable information (PII) should be anonymized. For more information on this, have a look at our article “Data Anonymization Techniques and Best Practices: A Quick Guide”. A data catalog is created for the various data elements to identify unusual activity such as unauthorized usage. It also manages the life cycles of data objects and simply makes sure that all data locations and data-related activities are as intended. Further, each user, depending on their function and data access requirements, is allocated a user-specific point of access to the data architecture.

If you have opted for the cloud, you will find a comprehensive overview of data security strategies for cloud computing in our article “Data Security in the Cloud: Key Concepts and Challenges”. 

Record Evolution Logo

About Record Evolution

We are a data science and IoT team based in Frankfurt, Germany, that helps companies of all sizes innovate at scale. That’s why we’ve developed an easy-to-use industrial IoT platform that enables fast development cycles and allows everyone to benefit from the possibilities of IoT and AI.

Your Data Architecture: Simple Best Practices for Your Data Strategy Read More »

cloud data platform for data science cloud-datenplattform

What Makes A Cloud Data Platform Your Best Friend?

A cloud data platform is your counterpart in delivering value and responding to the concrete needs of companies dealing with data-driven processes. Many of these companies depend on the capacity of data to generate insights. And the mission of the cloud data platform is to ease the path towards these insights. 

Another thing to consider is what platform capabilities you require if you are to meet those needs. For example, a cloud data platform should be elastic enough to be able to respond to the dynamics of changing business objectives and a volatile market. 

A third thing to consider is the data itself. According to DBTA magazine, “to deliver value, a data platform must understand data at a very deep and granular level.” 

A grip of user needs, platform capabilities that respond to those needs, and a data-centric approach to the data itself equip your data team with a holistic mindset towards data platform selection and implementation. 

So why are we turning to cloud computing and what does a cloud data platform do for you?

A cloud data platform grows with you

Cloud data platform architectures are designed for a fluxional and expansive data ecosystem. Instead of looking at rigid and static models such as the data warehouse or the database, a cloud data platform thrives on its capacity to re-calibrate according to customer needs. You can leverage a cloud data platform’s data processing capabilities to the fullest regardless of scale. 

A cloud data platform integrates

Cloud data platforms can be fully interoperable, allowing customers to plug in their favorite (open-source) tools and work with them from within the platform. Cloud data platforms such as the data science studio by Record Evolution can also come with a fully operational built-in infrastructure and a complete toolchain covering all data-related processes starting with data harvesting and transforming the data in data pipes, all the way to reporting and visualizing. 

The alternative to this would be purchasing and juggling between a variety of individual specialized tools. Apart from offering low user comfort, this solution may bring along the risk of vendor lock-in. It can also make it impossible to migrate your insights if you decide to switch to another tool. The bottom line with cloud data platforms is that you have one unified, holistically built environment that keeps your data, tools, and insights in one place so you can access them at any time. 

It unifies your data

Ideally, a cloud data platform provides a data hub bringing together data from different sources. These include streaming data from IoT data routers, collecting data from web sources, FTP connections, S3 Buckets, Twitter sources, or even files. The data is then cleaned in data pipes, consolidated, and transformed to deliver the insights you need to extract. Find out how to establish connections with various data sources here.

It unites your data tasks

On the advanced analytics platform, you can see how the tasks of the data engineer, the data scientist, and the data analyst are united and work in concert to serve the data journey. You start with a data import from a variety of sources and move on to data transformation in pipes. Once you have obtained cleaned and structured high-quality data, you can start thinking about creating a data model. On the platform, you write code in workbooks using SQL, Python or Markdown and generate reports or infographics. You can also go one step further towards customizing, as in the case of custom infographics.

It transcends industry borders

Cloud data platforms are open-ended and customizable. You have built-in and/or integratable tools to structure, analyze, and generate insight from any kind of data. Customers across industry borders can make use of the built-in analytics tools to generate reports and visualize data. It’s not about the industry. It’s about the data, the things that data can do, and the experience of delivering high-quality outcomes based on that data in a secure environment. The Record Evolution data science studio is not built for just one particular type of business customer or one particular industry but can serve a variety of use cases.

It simplifies and accelerates 

But a cloud platform is also about user comfort. It reflects the drive to offer a fully seamless, one-gesture data journey. A collaborative setup and enabling architectures allow you to do more things and do them much faster. Automation capabilities and streamlined machine learning make it possible to speed up your development cycle significantly. With a simplified and custom automated flow, you have more space for sophisticated tasks and room for experimentation. 

For the cloud: The Record Evolution data science studio

Built for the cloud, the Record Evolution data science studio has been created to encapsulate these virtues. A democratic, decentralized approach to data, the centrality of collaboration and a mindset of togetherness, as well as a focus on people and their data (or better still, data and its people), have been at the core of our vision.

And what is it that you, as an individual, can do on the platform?

Automate data extraction and data preparation

On the platform, you automate an array of data extraction and data preparation tasks to turn these into reproducible processes. Powered by embedded analytics, you can perform ad hoc querying, and optimize your data to serve as the basis for real-time decision making. A host of automatable loading, historization, versioning, monitoring, and control processes ascertains that basic data tasks are handled with guaranteed data quality and integrity. 

Data historization and versioning mechanisms

The mELT processes ensure that data imports are always correctly merged with the existing stock. This is possible because the target table has additional knowledge of the objects it contains. If a file is loaded twice, mELT recognizes which items already exist and doesn’t make any changes the second time. If the second file contains corrections, mELT will update only the corrected data, maintaining strict version control.

Create a long-term data strategy 

The data science studio enables the creation and maintenance of a long-term data strategy. On the platform, users consolidate data from heterogeneous sources. The data is then used in a wealth of applications across company functions. As a unified service, the platform keeps all heterogeneous data constantly updated within a unified repository where the various data types are kept in their native formats. From there, data can be accessed for a wealth of tasks (such as the creation of machine learning models) and for running a variety of applications.

Speed up decision-making processes by collaborating on a single platform

Various teams within an organization can securely access the same data and collaborate on the platform directly. There is no need to copy data warehouses locally or move the data elsewhere. Rather than using locally copied data that is static and quickly becomes outdated, multiple users can leverage the platform data in real-time, making sure that they are accessing the latest update. Knowledge sharing takes place across physical boundaries. This creates a global environment for analytics that changes the way we think of business contexts and intra-organizational collaboration.

Record Evolution Logo

About Record Evolution

We are a data science and IoT team based in Frankfurt, Germany, that helps companies of all sizes innovate at scale. That’s why we’ve developed an easy-to-use industrial IoT platform that enables fast development cycles and allows everyone to benefit from the possibilities of IoT and AI.

What Makes A Cloud Data Platform Your Best Friend? Read More »

evaluate data platforms image

How to Evaluate Data Platforms for Your Organization

And why a database is not enough:

Companies and organizations are increasingly using existing data to generate additional values. Traditionally known as the task of business administration analysts, data analysis has become indispensable to organizations. Companies need efficient long-term data architectures to support them in the ongoing effort to stay on top of current data-related challenges. In what follows, we discuss the aspects and technical intricacies that need to be addressed in building a data architecture and show how to evaluate data platforms that best suit company needs.

A data platform is not a database. Whereas databases are the foundation of data platforms, they do not equip you to handle analytics. A data platform, on the other hand, acts as an additional level on top of a database that is optimized to serve that purpose. In what follows, we illustrate the amount of functionality that is required to sustain a basic long-term data strategy within a company.

What is a data platform?

A data platform is a service that allows you to ingest, process, store, access, analyze, and present data. These are the defining features of what we call a data platform. When we evaluate data platforms, we prefer to break down a data platform into the following parts:

Data Warehouse: ingest, process, store, access

Business Intelligence: analyse and present

Data Science: Statistics and Artificial Intelligence (a special form of analysis)

Storing and processing data is at the heart of a data platform. However, this is only the beginning. To help you evaluate data platforms and for an overview of the most common data management tasks, we have compiled a list of 18 criteria concerning the data engineering and analytics cycle:

  1. Data Architecture (infrastructure, scaling, database)
  2. Import Interfaces
  3. Data Transformation (ETL)
  4. Performance
  5. Process Automation
  6. Monitoring
  7. Data Historization
  8. Data Versioning
  9. Surrogate Key Management
  10. Analysis / Reporting
  11. Data Science Work Area
  12. External Access / API
  13. Usability
  14. Multi-User Development Process
  15. In-Platform Documentation
  16. Documentation
  17. Security
  18. Costs

In the following sections, we provide a short introduction to each of these criteria without going into technical details.

How to evaluate data platforms in detail

data platforms as data architecture stacks

1. Data architecture

The core data architecture is a key aspect of a data platform architecture but by far not the only one. To find a suitable storage and database solution that meets your requirements, you can choose from three basic options:

Relational database

Mature database systems with built-in intelligence for the efficient handling of large datasets have the most expressive analysis tools. However, these are more complex when it comes to maintenance. Today, these systems are also available as distributed systems and can handle extremely large datasets. Examples include PostgreSQL with Citus or Greenplum cluster solution, MariaDB/MySQL with Galera Cluster, Amazon Redshift, Oracle, MSSQL, etc.

NoSQL sharding database

Also designed to handle extremely large datasets, these systems sacrifice some of the classical features of relational databases for more power in other areas. They offer much less analytical power but are easier to manage, have high availability, and allow for easy backups. There are efforts to mimic the analytical power of SQL that is available in the relational database world with additional tools such as Impala or Hive. If you have huge amounts of data, specific requirements for streaming, or real-time data, you should take a look at these specialized data systems. Examples include Cassandra, the Hadoop Ecosystem, Elasticsearch, Druid, MongoDB, Kudu, InfluxDB, Kafka, neo4j, Dgraph, etc.

File-based systems

It is possible to design a data strategy solely on files. File structures such as the Parquet file standard enable you to use affordable storage to accommodate very large data sets distributed over a multitude of storage nodes or on a Cloud Object Store like Amazon S3. The main advantage is that the data storage system alone is sufficient to respond to data queries. In the two examples described above, on the contrary, you need to run services on extra compute nodes to respond to data queries. With a solution such as Apache Drill, you can query parquet files with similar comfort to SQL.

When looking for the hardware architecture to support your data architecture, you have a few basic options:

  1. You can build your platform relying on services offered by major cloud vendors. On cloud platforms such as AWS, Azure, or Google Cloud, you can plug together a selection of simple services to create a data platform that covers our list of criteria. This might look simple and cheap on a small scale but can turn out to be quite complex and expensive when you scale up and need to customize.
  2. In contrast, there are platforms based on self-operated hardware including cloud virtual machines and individual software stacks. Here you have a maximum of flexibility but also need to address many of the criteria on our list by creating your code and custom solutions.
  3. Finally, complete independent cloud data platforms such as Repods, Snowflake, Panoply, or Qubole cover, to a greater or lesser extent, all items from our list of 18 criteria for the data engineering and analytics cycle.

2. Performance

A key criterion when it comes to platform choice, performance is mostly influenced by the database subsystem you choose. Our rule of thumb: The higher your performance requirements, the more specialized your database system choice should be.

3. Import interfaces

We categorize import interfaces into three different sections:

data platforms compared with data warehouses

I. Files

Files remain the most common form of data today.

II. Web Services

Plenty of web services with relevant data are available online.

III. Databases

Although many organizations store their data in traditional databases, in most cases direct database access is not exposed to the internet and therefore remains unavailable to cloud data platforms. Web services can be placed in between on-premise databases and cloud services to handle security aspects and access control. Another alternative is the use of ssh-tunneling over secure jump hosts.

IV. Real-time Streams

Real-time data streams as delivered by messaging routers (speaking WAMPMQTTAMQP, etc.) are still underutilized today but are gaining in significance with the rise of IoT.

4. Data transformation ETL

Data imported into the data platform usually has to undergo some data transformations before it can be used for analysis. This process is traditionally called ETL (Extract, Transform, Load). Data transformation processes usually create a table from the raw data, assign data types, filter values, join existing data, create derived columns/rows, and apply all kinds of custom logic to the raw data. Creating and managing ETL processes is sometimes called data engineering. This is the most time-consuming task in any data environment. In most cases, this task takes up to 80% of the overall human efforts. Larger data warehouses can contain thousands of ETL processes with different stages, dependencies, and processing sequences.

5. Process automation

When you have many sources, targets, and multiple data transformation processes in between, you also have numerous dependencies. This comes with a certain run schedule logic. The automation of processes is part of every data platform and involves a variety of processes of increased complexity. Alone for the scheduling of processes, a variety of dedicated tools such as Apache Airflow, Automate, Control-M, or Luigi have been made available.

Process automation also requires you to manage the selection of data chunks that are to be processed. For instance, in an incremental load scenario, every process execution needs to incrementally pick specific chunks of source data to pass on to the target. Data Scope Management is usually implemented by a metadata-driven approach. There are dedicated metadata tables that keep track of the process state of each chunk and can be queried to coordinate the processing of all chunks.

6. Monitoring

Larger data warehouse systems can easily contain hundreds of tables with hundreds of automated ETL processes managing the data flow. Errors appearing at runtime are almost unavoidable. Many of these errors have to be handled manually. With this amount of complexity, you need a way to monitor the processes on the platform.

7. Data historization

The need to manage longer histories of data is at the core of each data platform effort. The data warehousing task itself could be summarized as the task of merging separate chunks of data into a homogenous data history. As data is naturally generated over time, there arises the need to supplement an existing data stock with new data. Technically speaking, time ranges in tables are tracked using dedicated time range columns. Data historization is the efficient management of these time ranges when new data arrives. The most common approach for the management of such emerging data histories, data historization is different from data versioning in the sense that data historization is concerned with real-life timestamps whereas versioning is usually concerned with technical insert timestamps (see the section below).

8. Data versioning

By versioning data, you can track data corrections over time for the later recovery of old analyses. Versioning allows you to apply non-destructive corrections to existing data. When comparing versioning capabilities, you have to consider the ease of creating versions and the ease of recovering or querying versions. Versioning can be handled on different system levels:

a) Create version snapshots on the storage subsystem (similar to backups).

b) The underlying database system might come with support for version tracking.

c) Versioning might be handled by the data warehouse system.

d) Versioning can be implemented as a custom transformation logic in userspace.

9. Surrogate key management

Data platforms are used to consolidate data from many sources with different identifiers for the respective objects. This creates the need for new key ranges for the imported objects and the need to maintain them throughout consecutive imports. These new keys are called surrogate keys. Creating and maintaining these keys efficiently is no simple task.

10. Analysis / Reporting

evaluate data platforms data architecture manifold

The purpose of a data platform is to prepare raw data for analysis and store this data for longer histories. Analyses can be conducted in a variety of ways.

A variety of Business Intelligence Tools (BI Tools) are concerned solely with the task of creating analytical and human-readable data extracts. To prepare data chunks for presentation, a data platform provides features to create data extracts and aggregates from the larger data stock.

Answering specific business questions by intelligently querying the data stores requires a great deal of user proficiency in analytical query languages. BI Tools aim to simplify these tasks by providing point-and-click interfaces to answer basic questions such as “Number of visitors per month in-store” or “Sum of revenue in region X”. These tools also enable users to visualize the information via comprehensive graphics. In almost all cases, power users still want to be able to bypass these tools and conduct their own queries. Popular examples for BI-Tools include Tableau, Qlik, Looker, Chartio, and Superset, among many others.

11. Data Science

Training machine learning models is a requirement that today’s data platforms have to serve. Most sophisticated methods are implemented using not SQL, but Python or R together with a wide variety of specialized libraries such as NumPy, Pandas, SciKit Learn, TensorFlow, PyTorch, or even more specialized libraries for natural language processing or image recognition.

Since these tasks can be computationally demanding, extra compute hardware is often required in addition to the existing analytics hardware. While this opens up a large variety of tools for you to pick from, you are also facing the challenge of hosting and managing compute resources to back the potentially demanding machine learning jobs.

12. External access / API

All the collected data on the platform is there to be used for different purposes. Possible channels that are considered here are:

a) SQL access for direct analysis also by e.g. BI tools

b) API access (REST request) as a service for websites or apps

c) Notifications via Text Push or Email for end users or administrators

d) File exports for further processing or data delivery to other parties

13. Usability

The usability of the platform depends on the targeted audience. The main concern here is how easy it is to create and manage objects (such as users, data warehouses, tables, transformations, reports, etc. ) in the platform. Often there exists a trade-off between the level of control a user gets and the level of simplicity. Here we have to distinguish between the functionality that the platform provides and the user-generated content inside the platform. In most cases, the user-generated content requires the use of code, since the whole subject of data engineering and analysis is by nature complex and requires a high level of expressiveness.

14. Multi-user workflow

This category evaluates the support for user interactions and the sharing of work and data. This aspect involves real-time updates of user actions, collaborative work, sharing, and role assignments, as well as a way to discuss and comment on the platform.

15. In-platform documentation

A data platform is used to implement a high degree of custom complexity with a multitude of participating users over a longer period.

This requires detailed documentation of the user-provided content. To properly evaluate data platforms, we also assess how different platforms support this functionality. Documentation can always be prepared outside of the platform, but this implies the risk of information divergence as external documentation quickly becomes outdated.

16. Documentation

All platforms require a certain degree of user proficiency. Proper documentation detailing the platform features is therefore required for the professional use of a platform.

17. Security

Data Platform Security can be separated into security of storage (data in rest), interaction (data in transport), and access control.

18. Cost structure

We identify three major cost drivers that help us to evaluate data platforms:

  1. Licenses
  2. Infrastructure (Hardware)
  3. Staff

Today, most of the software stack can be implemented in high quality using open-access software. Licensed software, however, usually require less maintenance effort and low-level system know-how.

Compute hardware can be used by cloud providers on a pay-per-use basis. The same also applies to storage infrastructure.

To estimate your hardware costs, you need to consider an infrastructure covering the following components:

  • Database
  • Data Transformations
  • Analytics
  • Data Science
  • Automation
  • Monitoring
  • Hosting of Content

Even though the database is usually the largest component of a data platform, it is by far not the only one.

Conclusion

data architecture mosaic

How to evaluate data platforms can in no way be reduced to their underlying core database. Instead, data platforms can be seen as ecosystems of services that require perpetual balancing.

Our proposed list of 18 criteria of how to evaluate data platforms has provided a basic entry point for evaluating data platforms in terms of their suitability as long-term manageable data platforms. These criteria are primarily relevant for organizations aiming toward the aggregation of longer data histories for more comprehensive statistics and forecasts.


Subscribe to our blog!
Stay up to date with the latest data science and IoT tips and news.

How to Evaluate Data Platforms for Your Organization Read More »