This article provides a quick cloud data platform comparison and feature assessment of three cloud data platforms: Snowflake, Panoply, and Repods.
The three platforms Snowflake, Panoply, and Repods are cloud services that allow you to ingest, process, store, and access data in a managed cloud infrastructure. Also, each platform provides integrated compute and storage resources for data as part of the service. These are considered the defining features of what we call a cloud data platform. These features allow us to distinguish cloud data platforms from other cloud data services that enable you to present or process data in the cloud but do not allow you to store large amounts of data. The following cloud data platform comparison helps you to understand how these three platforms, each with its distinctive focus features, stand in relation to one another.
Although storing and processing data is at the heart of every cloud data warehouse effort, this is only the beginning of a far more complex process. More functionality is needed if a cloud data platform is to serve as a long-term manageable basis for analytics. To provide a good overview of the overall data management tasks involved here, we have prepared a list of criteria concerning the data engineering and analytics cycle. The three platforms Snowflake, Panoply, and Repods are not targeting the same audiences and feature sets. As such, therefore, the platforms are not directly comparable in all aspects. The comparison of the cloud platforms Panoply and Snowflake is based solely on publicly available information.
Our list of criteria includes the following items:
- Architecture (infrastructure, scaling, database)
- Import Interfaces
- Data Transformation (ETL)
- Process Automation
- Multi-User Workflow
- Data Historization
- Data Versioning
- Surrogate Key Management
- Analysis / Reporting
- Data Science
- External Access / API
- In Platform Documentation
What features to consider in a cloud data platform comparison
Panoply uses the Amazon Redshift Data Service together with the Elasticsearch Database, Amazon S3 Storage, and Spark Compute Architecture. Amazon Redshift is a scalable database with roots in the PostgreSQL database architecture but with added cluster abilities. It runs solely as an Amazon Web Service. This Architecture allows for online scaling by adding more nodes to the cluster. Different Panoply clients share the same infrastructure. Therefore, the highly demanding query loads of one client could affect the query performance of another client. Panoply manages the position of your database and could potentially locate your database on a separate Redshift cluster. With Panoply, you should be able to create multiple databases that are Amazon Redshift databases. Databases, in this sense, have separate storage areas but share the same query engine (i.e. a DBMS system).
Snowflake’s underlying architecture layers are not disclosed in detail by Snowflake. Here we also have an online scaling platform with a clear separation of storage and compute resources. Snowflake allows you to create and manage multiple data warehouses in one account. You can configure your compute cluster sizes per warehouse in detail and even configure the online auto-scaling for each warehouse. In Snowflake, you can scale up (=more resources on one machine) as well as scale out (=more machines), both without interruption of service. Data warehouses in Snowflake do not share compute resources to ensure the stable performance of each warehouse. Snowflake provides direct database access with external tools as if they were local to the database.
Repods’ underlying architecture consists of native PostgreSQL (version>10) and TimescaleDB for large amounts of time-partitioned data. Storage is managed and scaled through dedicated storage clusters delivering optimal IO speed and online petabyte scaling. You can create, manage, and share multiple data warehouses per account. For stable query performance, the different data warehouse instances on the platform rely on dedicated resources not shared with any other instance in the cluster.
2. Import Interfaces
We categorize import interfaces into three different sections:
This is still the most common form of data.
II. Web Services
Plenty of web services with relevant data are available online.
Although many organizations store their data in traditional databases, in most cases direct database access is not exposed to the internet and therefore remains unavailable to cloud data platforms. Web services can be placed in between on-premise databases and cloud services to handle security aspects and access control. Another alternative is the use of ssh-tunneling over secure jump hosts.
IV. Realtime Streams
Panoply features plenty of import options in all four categories. Panoply, however, can neither pull files from a cloud bucket or SFTP according to an automated schedule nor can it request a RESTful URL according to a schedule.
Snowflake focuses on loading files only but allows you to load files from cloud storage including Amazon S3 or Microsoft Azure. With Snowflake, you can observe the sources for the arrival of new files and automatically load them.
Repods enables you to upload files, load files from S3 Buckets, or load data from external SFTP servers including the automatic load of new files. Repods does not provide you with a separate interface for all possible services on the web but gives you a generic API to schedule web requests from any kind of service (e.g. Socrata). You can subscribe and listen to topics on message routers (currently WAMP only) to ingest data in configurable micro-batches. With Repods, you can currently import data from all categories listed above except databases.
3. Data Transformation ETL
Data imported into the data platform usually has to undergo some data transformations before it can be used for analysis. This process is traditionally called ETL (Extract, Transform, Load). Data transformation processes usually create a table from the raw data, assign data types, filter values, join existing data, create derived columns/rows, and apply all kinds of custom logic to the raw data. Creating and managing ETL processes is sometimes called data engineering. This is the most time-consuming task in any data environment. In most cases, this task takes up to 80% of the overall human effort. Larger cloud data platforms can contain thousands of ETL processes with different stages, dependencies, and processing sequences.
In Panoply you use code to create data transformations. These transformations either provide virtual data results that are recomputed each time you access the data (“Views”) or are materialized to save the recompute effort for each access. If the data is materialized, it has to be manually refreshed to update the information. For instance, if new data is available in the sources, you have to hit the refresh button to execute the transformation. Depending on the size of the sources and the complexity of the transformation, you may want to store your intermediate results in a dedicated results table.
Companies often need to load new data on top of existing data in a table. Key here is the correct incremental load of the new data. Depending on the requirements, this effort can involve very complex transformations (see the section on Historization below). Panoply does not offer specific support for such historization processes.
Snowflake has a similar approach to data transformations. You can use the platform’s SQL dialect to implement data transformations in the form of SQL queries and then materialize them in new tables as you see fit. In Snowflake, you get low-level control over your data objects in comparison to working with pure databases. You can create tables, indexes, views, queries, or partitions similar to other traditional database systems such as PostgreSQL, MySQL, Oracle, or DB2. On the one hand, you can create powerful queries, update and insert statements, and get fine-grained control over every aspect of your transformations.
On the other hand, however, you need to have the know-how of a database administrator to maintain the system – as would be the case with traditional databases. Further, the platform includes the advanced time-travel feature that allows you to query a table as seen at a specific point in time (up to 90 days back). In the Postgres database, this feature has been discontinued due to high-performance overhead and complexity. All this being said, however, this cloud data platform does not come with support for automated historization of data out of the box.
In Repods, you can create data transformation with the help of the so-called Data Pipes to transform data from raw data into a specific data warehouse table. Transformations are also created using PostgreSQL queries. In these queries, you do not have to re-implement data insert strategies for each transformation because the actual insert into the target table is always handled by an integrated post-process that applies complex processes such as deduplication, key generation, historization, and versioning. This allows you to focus on the business content in your transformations and not worry about database management tasks.
4. Process Automation
When you have many sources, targets, and multiple data transformation processes in between, you also have numerous dependencies. This comes with a certain run schedule logic. The automation of processes is part of every data platform and involves a variety of processes of increased complexity. Alone for the scheduling of processes, a variety of dedicated tools such as (Apache Airflow, Automate, or Control-M have been made available.
Process automation also requires you to manage the selection of data chunks that are to be processed. For instance, in an incremental load scenario, every process execution needs to incrementally pick specific chunks of source data to pass on to the target. Data Scope Management is usually implemented by a metadata-driven approach. There are dedicated metadata tables that keep track of the process state of each chunk and can be queried to coordinate the processing of all chunks.
Panoply and Snowflake do not provide any support for automation out of the box and require you to use other tools for this task.
Repods uses a data-driven approach to support process automation. The system keeps track of all dependencies and resolves them automatically. Repods even looks into the transformation results to determine the further course of action. Each Pipe in Repods can be automated to run whenever new data is available in the sources and the target is not currently blocked for the insert. Repods also comes with a built-in Scope Management process. For each combination of source and Pipe, a dedicated scope view is created to manage the next chunk to present to the pipe for processing. Scope Management can be turned off for each Source and Pipe individually. Scopes can be created based on chunk number or based on time ranges.
Larger data warehouse systems can easily contain hundreds of tables with hundreds of automated ETL processes managing the data flow. Errors appearing at runtime are almost unavoidable. Many of these errors have to be handled manually. With this amount of complexity, you need a way to monitor the processes on the platform.
In Panoply, you can view a history of executed queries and executed Jobs. Also, you get an alert view listing problems in your sources, services, and other assets. The query log polls the server for updates every few seconds (no real-time refresh).
Snowflake comes with monitoring reports that provide aggregated information on the number of active queries in a certain time range and their resource consumption. In Snowflake, you can access the internal metadata tables with SQL to extract information on any query activity in the system. Snowflake also shows historic query execution in the web interface.
Repods offers a real-time graphical overview showing details on every Pipe that is currently executed or has been executed in the past. This overview is updated in real time and allows you to drill through thousands of (historical) Pipe executions. You can also search for specific Pipe executions in a search drawer. Further, Repods allows you to analyze the system log with SQL to extract information about a specific Pipe execution.
The usability of the tools depends on the targeted audience. The main concern here is how easy it is to create and manage objects (such as users, data warehouses, tables, transformations, reports, etc.) in the platform. Often there exists a trade-off between the level of control a user gets and the level of targeted simplicity. Increased simplicity entails less control. All three platforms, Panoply, Snowflake, and Repods, make use of a cloud environment and provide user access using a browser-based web application. Also, all three platforms follow a hybrid approach using code and involving a point-and-click interface.
Snowflake provides the user with a level of control similar to what you would expect from a bare-bones database system. Most of this control is handled by writing code in a code panel (imports, tables, views, indexes, etc.). Major tasks like the creation and deletion of whole data warehouses and users can be handled through web forms. The web interface does not respond to activities on the platform and handles updates by refreshing its views in regular time intervals (the query history refreshes every 10 seconds). Snowflake is only available in English.
In Panoply, the creation of objects is handled via web forms. The setup of imports can be handled through web forms as well, while transforms and analytics are entered through code panels. Panoply is only available in English.
In Repods, custom transformation logic is handled in code panels. The creation of objects is handled through web forms. For analytics in Repods, you can use either a workbook approach to create custom analytical queries (compare Jupyter data science workbooks). Alternatively, you can use a built-in OLAP tool to drill through the data model with point-and-click functions for quick insight. You can also use code to create custom web visualizations. The Repods web application is a real-time app based on a bidirectional messaging system. This means that every user interaction throughout the platform and every backend system activity is immediately visible to all users without executing refreshes in the front end. The Repods platform is available in English and German with other languages following soon (Spanish, Portuguese, etc.).
7. Multi-User Workflow
This category evaluates the support for user interactions and the sharing of work and data. All three platforms allow multiple users to work together in a data environment.
Panoply offers one infrastructure environment for all users. However, users can create multiple Amazon Redshift databases on this infrastructure. Panoply has no communication features to facilitate communication between users on the platform. There is limited possibility to provide documentation for data objects and transformations in the platform. You can manage teams if you are assigned the roles Admin or Editor.
In Snowflake, multiple data warehouses with fine-grained access control options can be managed per account. Similar to the access control you have on database architectures, you can create custom roles with custom privileges on all objects in the data warehouses.
In Repods, you can manage multiple data warehouses (called Data Pods) per user and provide access for other users (similar to the way the GitHub platform is organized). You can allow users to ‘follow’ and/or self-register to a Data Pod. User interactions are instantly visible to all other users. The owner of a Data Pod can assign the access roles Viewer, Reporter, Developer, Admin, and Owner. Each user can be assigned one of these roles in each Data Pod. Inside the Pod, tables can be grouped using Labels, and each user can receive individual authorizations based on Labels.
8. Data Historization
The need to manage longer data histories is at the core of each data platform effort. The data warehousing task itself could be summarized as the task of merging separate chunks of data into a homogenous data history. As data is naturally generated over time, there arises the need to supplement an existing data stock with new data. Technically speaking, time ranges in tables are tracked using dedicated time range columns. Data historization is the efficient management of these time ranges when new data arrives. The most common approach for the management of such emerging data histories, data historization is different from data versioning in the sense that data historization is concerned with real-life timestamps whereas versioning is usually concerned with technical insert timestamps (see the section below).
Repods provides support for the historization of data out of the box. Data historization is applied after the data transformation and before the insert into the data warehouse table. The algorithm is maximally space-efficient in the sense that it minimizes the number of records to represent the histories. For data that does not change significantly over time, this approach can greatly reduce the size of the tables and therefore have a positive impact on overall query performance.
The remaining two platforms allow for the historization of time ranges but these have to be managed by user-provided transformation logic.
Panoply provides a feature called history tables which will be discussed in the upcoming section Data Versioning.
9. Data Versioning
By versioning data, you can track data corrections over time for the later recovery of old analyses. Versioning allows you to apply non-destructive corrections to existing data. When comparing versioning capabilities, you have to consider the ease of creating versions and the ease of recovering or querying versions. Versioning can be handled on different system levels:
a) Create version snapshots on the storage subsystem (similar to backups).
b) The underlying database system might come with support for version tracking.
c) Versioning might be handled by the data warehouse system.
d) Versioning can be implemented as a custom transformation logic in userspace.
The variant d) is a technical and complex process but can be implemented in all three systems.
Panoply provides continuous backups as a built-in versioning option. You can recover a point-in-time snapshot at any time in the backup time frame. However, you cannot query versions in the active system using this method. In Panoply, you can also create history tables that insert only companion tables to the original tables. When updates to the original table occur, the system automatically inserts the existing records into the companion table enriched with the time ranges representing the changes. This process can be understood to be a versioning process because the managed time ranges are based on technical insert timestamps and not on business validity time stamps. These history tables allow you to query different versions of the same data with SQL.
With Snowflake you can a) easily create snapshots of all data and b) get the time-travel feature provided by the database system. The latter allows you to flexibly query data with SQL as seen at a certain point in time. This feature is available only in the enterprise edition and covers only 90 days of history. Long-term versioning logic has to be implemented by the users.
The Repods platform offers versioning support that is specifically designed for the data warehouse use case. You can use freeze timestamps to ensure the recoverability of your data as seen at the time of the freeze. Also, using simple SQL, you can flexibly query the data as seen at old freeze times (infinite number of days back). Versioning often comes with the creation of multiple version records during load times. This leads to increased table size which hurts query performance. To avoid this, you can set a second date to prevent the versioning of any data newer than this second date. For instance, this feature allows you to freeze all data before October while loading October data.
10. Surrogate Key Management
Data platforms are used to consolidate data from many sources with different identifiers for the respective objects. This creates the need for new key ranges for the imported objects and the need to maintain them throughout consecutive imports. These new keys are called surrogate keys. Creating and maintaining these keys efficiently is no simple task.
Repods is the only tool providing extensive support for internal surrogate key creation and management. You can select a few columns as the alternate keys and the system will automatically generate a surrogate key from them. The alternate keys will then be mapped to the surrogate keys in every load into a Core Table.
11. Analysis / Reporting
The purpose of a data platform is to prepare raw data for analysis and store this data in longer data histories. Analyses can be conducted in a variety of ways.
A variety of Business Intelligence Tools (BI Tools) are concerned solely with the task of creating analytical and human-readable data extracts. To prepare data chunks for presentation, a data platform provides features to create data extracts and aggregates from the larger data stock.
Panoply and Snowflake
Panoply and Snowflake provide you with tools to create analytical aggregates that are similar to the tools you use for creating data transformations. This involves SQL code editors. Snowflake only saves query results for 24 hours. Furthermore, both Panoply and Snowflake allow you to access the platform with other tools via ODBC (and similar) – secured by username and password. You can use all kinds of dedicated analysis tools such as Jupyter workbooks or Power BI to analyze your data. However, you cannot use these platform resources to run Python or train machine learning models. Also, you cannot integrate the results of these workbooks into your data workflow inside the platform.
Repods allows access to the platform via external tools and additionally provides you with their own workbooks to create data stories and analytical extracts. Workbooks are sheets consisting of many SQL and Python code cards together with markdown cards. Besides workbooks, Repods also contains an OLAP (online analytical processing) interface that allows you to drill through the data model using a point-and-click approach to create report results. The tight integration of workbooks and reports into the platform allows you to include your OLAP reports and workbook cards in your automation to create near real-time reports and statistics.
Repods enables you to create custom web assets called Infographics right on the platform. This allows you to generate custom web visualizations using the state-of-the-art presentation library d3.js. The system pushes the data from the data platform right into the browser of every viewer using a bidirectional messaging technology. In this way, Repods enables you to create fully automated data pipelines – beginning with sourcing the data, going through ETL transformations, and rounding the process off with the visual presentation of the data in the user’s browser. Updates are taking place in near real time. Infographics can be hosted on Repods as a very lightweight web request endpoint and can be embedded as an iframe into any website.
12. Data Science
Training machine learning models is a requirement that today’s data platforms have to serve. Sophisticated methods are implemented using not SQL, but Python or R together with a wide variety of specialized libraries such as NumPy, Pandas, SciKit Learn, TensorFlow, PyTorch, or even more specialized libraries for natural language processing or image recognition.
None of the discussed platforms currently provides a way to conduct data science tasks directly on the platform. The current strategy is to access the platform with specified tools and conduct all data science tasks outside of the platform. While this opens up a large variety of tools for you to pick from, you are also facing the challenge of hosting and managing compute resources to back the potentially demanding machine learning jobs.
Repods features Python cards in their workbooks to allow for the creation of machine learning models directly inside the platform. This strategy enables you to leverage the platform for machine learning, using all its computational power.
13. External Access / API
Here we want to compare how the three platforms present their content to external consumers. Possible channels that are considered here are:
- a) SQL Access
- b) API Access (REST Request)
- c) Notifications via Text Push or Email
- d) File Exports
Panoply provides direct ODBC (or similar) access to the platform secured by username and password. This enables almost all standard Business Intelligence tools like Looker or Tableau to connect to Panoply and use the data. Panoply tracks system alerts but does not send notifications to your phone or email for that.
Snowflake also provides SQL access for other tools to consume the data inside of Snowflake. Additionally, Snowflake provides the possibility to export files into Cloud buckets (AWS S3 or MS Azure). Snowflake enables you to set up email notifications for general Snowflake service availability only. You cannot set up system push notifications, nor can you set up content-based notifications that alert you if, for example, the number of customers in France exceeds 1000.
Repods allows you to create API access keys and control access to prepared data extracts within the platform. You can then hand out these access keys to external consumers to access those resources via a standard REST request in any programming language. For example, you can connect the Business Intelligence tool Power BI to the Repods platform to create compelling dashboards. File exports of unlimited size are supported via direct downloads from the browser. No kind of push notification is currently supported.
14. In-Platform Documentation
A data platform is used to implement a high degree of custom complexity with a multitude of participating users over a longer period.
This requires detailed documentation of the user-provided content. Here we assess how the platforms support this task. Documentation can always be prepared outside of the platform. This, however, implies the risk of information divergence as external documentation quickly becomes outdated.
Panoply and Snowflake provide a few one-line description fields in their web forms. However, full documentation has not been made available yet. Other than that, you can nevertheless document code in the form of comments within the code itself.
In Repods allows you to document all tables and transformation processes, each in a markdown panel which is closely linked to the respective object. You can even document each column of the core tables which is considered especially important for users of the platform because the column definitions are tightly linked to the meaning of the data.
All platforms require a certain degree of user proficiency. Proper documentation detailing the platform features is therefore required for the professional use of a platform.
Snowflake provides the most extensive online documentation, comparable to the way databases or programming languages are documented. For Snowflake, this is necessary because most of the functionalities are provided through the use of Snowflake’s own dialect of SQL. Therefore, large parts of the documentation are concerned with the Snowflake SQL capabilities. Additionally, Snowflake provides many guides and YouTube videos to get you introduced to the platform.
Panoply offers less detailed online documentation. Panoply exposes the underlying Amazon Redshift tables directly to users for SQL access, so the Amazon Redshift SQL documentation is the valid reference for the SQL dialect of Panoply. Also, Panoply offers general guidance for data warehousing and a few explainer videos on YouTube.
Data Platform Security can be separated into the security of storage (data in rest), interaction (data in transport), and access control.
All three platforms encrypt data in rest and transport. All three platforms provide password-based user authentication. Snowflake additionally provides the more secure Two-Factor Authentication mechanism.
Panoply and Snowflake allow you to access the underlying databases through ODBC-like interfaces with direct server connections. Exposing database ports on the internet can be considered a security risk. To mitigate this, you can (and should) use the more secure ssh-tunneling method with a dedicated jump-host to access these platforms.
Panoply only provides flat monthly pricing packages. Snowflake scales prices depending on the infrastructure size of the used platform. Repods offers both models: you have flat packages as well as usage-based pricing. All three platforms will require you to contact them for the pricing of larger installations.
To test the platform Snowflake and Panoply provide you with a free time-limited test period of their platform. Repods offers one small free data warehouse called a Data Pod for testing for an unlimited time period.
Cloud data platform comparison: the takeaways
To build a complete cloud data platform, Snowflake, as well as Panoply, need to be combined with additional tools to cover the missing aspects of data warehousing. Most notably, the automation of data transformations is an important requirement of every cloud data architecture. Depending on the organizational context, user technical proficiency also plays a key role here as all three platforms require basic data engineering know-how and allow users to use SQL code.
Snowflake may be a good choice if you are looking for an online database architecture that is used primarily in an analytics context and you do not want to invest in system and hardware administration. The platform requires database administrators to manage the platform and is addressing larger data environments. As stated above, Snowflake misses many important features that distinguish a cloud data platform from a database. Since Snowflake seems to be quite similar to a general-purpose database in its usage patterns, it may also be an option to use a managed database on AWS instead if you do not have special requirements.
Panoply is simpler than Snowflake. No system, hardware, or database administration know-how is required to use Panoply. Like Snowflake, Panoply misses many aspects that distinguish a cloud data platform from a database. Panoply appears to be best suited for less complex data environments.
Repods is simpler than Snowflake because it does not require any administrative know-how for hardware, system, or database. The cloud data platform offers a more comprehensive cloud data platform experience since the platform additionally covers the aspects of automation, historization, surrogate key maintenance, analytics, machine learning, and Data Science. Repods has a much stronger focus on building a multi-user and multi-data cloud data platform environment than the remaining platforms. The Repods management interface is ideal for a project-oriented approach to cloud data platforms as opposed to a big central data warehouse architecture. In sum, Repods could be described as a data catalog consisting of data warehouses.
Disclaimer: The author of this article is from the Repods platform.
Subscribe to our blog!
Stay up to date with the latest data science and IoT tips and news.