path to becoming a data scientist

The Path to Becoming a Data Scientist

What Skillsets Do You Need?

Data science surfaced as a term in the 1960s to denote a line of work that specifically deals with the task of making sense of large volumes of data. In 1962, John Tukey published The Future of Data Analysis to call for a rethinking of academic statistics. Tukey, who worked at the intersection of industry and academia (both at Bell Telephone Laboratories and at the Department of Statistics at Princeton University), alluded to the existence of a yet unrecognized science that studies ‘learning from data’, or ‘data analysis’. This may have been the first data scientist.

But it wasn’t until the early 2000s that the term became more specialized, not least because of William S. Cleveland’s paper Data Science: An Action Plan for Expanding the Technical Areas of the Field of Statistics.” John Chambers and Jeff Wu have also been credited with expanding academic statistics towards data science, increasing the emphasis on data preparation, presentation, and prediction rather than statistical modeling and inference. The first peer-reviewed periodical in the field, the CODATA Data Science Journal, came out in 2002 with the International Council for Science: Committee on Data for Science and Technology. In 2003, Columbia University followed up with The Journal of Data Science.

Fast forward to 2010: This is when the term gained true popularity outside of academic and industry circles. This happened when Mike Loukides, the Vice President of Content Strategy at O’Reilly Media, published the article “What is data science?” declaring that “the future belongs to companies and people that turn data into product”. Some ten years after this event, we still don’t know what data science is.

Over the years, data analysts, statisticians, and computer scientists have been working on computational environments for data analysis and data applications that would impact businesses by using data to generate novelty. Definitions of data science evolve and change together with the changing landscape of technology and data handling approaches.

The Journal of Data Science continues to sport a maximally broad definition: “By ‘Data Science’ we mean almost everything that has something to do with data: Collecting, analyzing, modeling … yet the most important part is its applications—all sorts of applications.”

In a recent paper, David Donoho called data science “the really important intellectual event of the next 50 years” predicting that, “because all of science itself will soon become data that can be mined, the imminent revolution in data science is not about mere ‘scaling up,’ but instead the emergence of scientific studies of data analysis science-wide.”

The Data Scientist Today

Our current innovation landscape is witnessing even more analytical and compute power. We have a far greater exchange and use of accessible data. Also, we are facing enhanced decision-making capabilities based on data science approaches. The confluence of these three forces is what drives the continued evolution of the interdisciplinary data scientist. Data science is particularly receptive to movement across and between sectors. We also have the phenomenon of the so-called “braided careers”. These entail the pursuit of dual or multiple careers in both academia and the industry, academia, and the public sector, or in the industry and the public sector.

The Data Scientist in Skill Clusters

All of this calls for new and perhaps unprecedented combinations of skills and data skills across the board. But what are the basics?

Hard Skills

  • Years of Rigorous (Formal) Training. To develop the depth and breadth of knowledge associated with these roles, most data scientists may need a solid background in the sciences. According to recent data, the most common academic fields for data scientists are Mathematics and Statistics (25%), Computer Science (20%), Natural Sciences such as Physics (20%), and Engineering (18%). Over 90% of data scientists have at least a Master’s degree and 48% have PhDs.
  • Python & Co. A recent study on the dynamics of data science skills in the UK reveals that Python is the most popular coding language. Some 43% of data science job ads generated in the period 2013-2018 listing it as a skill requirement. Other top skills include SQL (27%), Machine Learning (26%), Big Data (25%), Hadoop (19%), and a strong background in research (20%). The top skills clusters are scripting languages (87%), Big Data (63%), SQL Databases (57%), Data Analysis (39%), Statistics (27%), Statistical Software (33%), and software development principles (26%).

Soft Skills

  • Intellectual curiosity. Data science is driven by discovery with a healthy amount of invention. This is not simply about providing answers to existing questions and looking into that which already exists. It is about the ability to generate new questions and delve into data in new, imaginative ways. The intellectually curious remain open-minded, inquisitive, and genuinely intrigued.
  • Communication. A data scientist must be an outstanding communicator, packaging the business insights in impactful language. Apart from having the scientific mindset to tackle a problem, data scientists will have to deliver the outcomes to their businesses and to the public. They collaborate with many functions across the organization in order to fully understand the scope of the problems to be addressed. The ability to articulate one’s findings to a non-technical audience is crucial.
  • Business acumen. This is about understanding the impact of findings on business goals. Data scientists need a deep understanding of the industry and the business problems to solve. Added to this is the ability to come up with new ways of dealing with problems and identifying which problems are actually worth solving.

The Tasks of a Data Scientist

What would be the tasks of a data scientist?

  • Harvesting, Preparation, Exploration, Transformation. Collecting data coming from a wealth of formats, drilling through the data, carrying out data transformations, and preparing the data for tasks such as reporting.
  • Data Representation. Presenting the data in one format that lends itself to analysis.
  • Data Modeling. Representing the complexity of a software system design in accessible diagrams.
  • Data Visualization. Using visualization tools to represent the data in an accessible way that allows for the detection of trends, patterns, and outliers in the data.
  • Going meta. Data scientists can also do science about data science. They locate recurring analysis workflows, measure the effectiveness of established workflows in terms of various performance metrics, and look for emergent phenomena in their data analyses.


How about the areas of work?

  • Data infrastructure. This involves data ingestion, availability, access, and running environments to enhance the workflows of data scientists.
  • Data engineering. This involves the determination of data schemas to support data aggregation, data cleansing, extract, transform load (ETL) tasks, as well as dataset management.
  • Data quality and data governance. This involves the processes and guidelines that are implemented to make sure the data is standardized, correct, monitored, documented, and secured.
  • Data analytics engineering. This involves scaling via analytics applications for internal use.
  • Data-product product manager. Here you create products that internal customers can use within their workflow and with which they enable the incorporation of measurements created by data scientists.


But given this rigorous skillset, what would be the drivers for a data scientist? Here is our list:

  • Intellectual freedom
  • Space for creative invention
  • Capacity to co-shape the innovation landscape
  • High-level open research with real-world applications
  • Recognition
  • Industry and/or academic impact that benefits the public

Curious? Find out more about our available roles and on-the-job training opportunities.

Record Evolution Logo

About Record Evolution

We are a data science and IoT team based in Frankfurt, Germany, that helps companies of all sizes innovate at scale. That’s why we’ve developed an easy-to-use industrial IoT platform that enables fast development cycles and allows everyone to benefit from the possibilities of IoT and AI.