image for fifa article on data modeling

FIFA World Cup Statistics: A Data Science Perspective . Data Modeling with Repods,  Part I

Data-driven historical stats from the FIFA World Cup between 1930 and 2014 :

Following our tutorials series on Repods, we can now introduce this slightly more complex example. If you are just starting to work with the platform, we suggest that you get started with our introductory tutorial. However, if you feel confident with data modeling, enjoy the following analysis — an analysis of the most widely viewed sports event in the world: The FIFA World Cup.

The dataset provided by the Fédération Internationale de Football Association (FIFA) is divided into three files containing detailed information about matches, players and the World Cups in general. The goal of this article is to create an iterative and incremental data analysis starting from a model and readapting it according to features observed in the original dataset.

0. Planning the model

Let us prepare for our data modeling task first. Before starting an analysis, as mentioned in our introductory article, it is essential to be aware of the scope of our information and to keep in mind the relations between real-world entities.

Let’s get started by breaking down the World Cup into 9 entities as displayed in Figure 0. Now we can create a quick draft by sketching the relations using a simplified Entity-Relationship model.

planning the data model before getting started
Figure 0. Planning the data model

As we don’t know which are the attributes of each entity, we simply describe them with an ID and a Name. Then we add attributes that we consider necessary for the interconnections.

We have arranged all relations around the entity Match. After all, this is what the World Cup is about. Now that we have created this overview, we can go to Repods and get to work.

Go to Repods, log in into your account (or sign up if you don’t have an account yet), and create a free Data Pod. You can call your Pod FIFAWorldCupAnalysis.

1. Importing files

Once this is done, you can download the datasets World Cup Players, World Cup Matches, World Cups from here (provided as a courtesy of the FIFA World Cup Archive to Kaggle) and import them to Repods. If you’re not sure how to do that, take a quick look at our introductory article first.

After uploading the files, you can select each of them, and click to take a look at the Raw Table, File Structure and Column Datatypes. Take a quick look at how the data is structured; carefully observe the datatypes and the possible formatting/transformations.

importing new files for data modeling

At this point, it is important to apply smaller transformations that help to specify the appropriated data types for what we have imported. For example, you might see that the WorldCupMatches.csv file contains a column “Datetime” with something such as “12 Jun 1930 15:40” you can convert it to a date or keep it as text.

But here comes the first tricky part: All dates are displayed in the current user time. However, the information about the match time is only relevant in the local time of the match (since it might indicate the weather conditions and players’ physical/psychological conditions). So how do we deal with this?

There are many options. The simplest is to accept the time modification and tell whoever uses this data that the time presented is the user time. Another option is to ignore this fact completely, which may lead to misinterpretation. As we want to keep this information, we can simply postpone the data type formatting to the Pipes Transformation, leaving this field as a text. After the importing is finalized, we can open the raw table by clicking on the Raw Table icon and then adding a “Time” field with the transformation shown in Code Snippet 1.1:

to_timestamp(substring("Datetime", 15, 21, 'HH24:MI:SS')::time
Code Snippet 1.1 Extracting Time from Datetime Column

Now we are ready to keep the time as in the original format and still be able to use the date later.

2. Start modeling your data

We start creating the tables based on the draft we have drawn at the beginning of this article. Then we are incrementing it with more data and details. Once you have created your data pod, click on Core Data (shortcut Alt+3) and then click on Add Table.

Let’s start creating the World Cup table. Here you can choose between an Event Table and a State Table. Even though the World Cup is an event per se, we want to track the evolution and events that happen within it. So we can opt for State Table. On the other hand, matches are entities well defined by a specific point in time, which is why we choose the table type Event Table.

columns in the world cup table

Event tables have a default field called EVENT_DATE that we can fill in with the data from the raw sources later on.

The center of our model is the Match entity. It is the entity with the largest number of relations. We can observe this when we create it with columns that connect to Round, Team and Venue entities.

So let’s create all tables as described by the draft we have shown at the beginning of this article (Figure 0).

Since we are creating all tables at once, we can’t really establish relations until the draft is ready. The tables without relations would be shown in Repods as displayed below:

image of tables without relations

Once we’ve created the entities, we can start interconnecting them using the draft fields. Let’s start with the model table. Go to Relations to Other Core Tables and select the columns and the tables to which they belong as shown below.

relations to other core tables

We can see that Repods creates intuitive reverse names for the relations. This function is helpful when it comes to creating reports in the future or checking whether the created connections have semantic relations.

We can repeat the procedure for all entities following the model draft until reaching the model overview shown above. In this view, it is possible to see the relation names. You can observe the relations from the point of view of any entity.

tables with relations between them

Once you have completed this step, let’s start populating our entities with some data.

3. Pipes transformation

Pipes are where data modeling happens in Repods. Here we can create SQL queries that combine different sources and modify the original data to fit into our model.

To get to the core of our data modeling task, let’s go to Pipes (Alt+2) to see all created entities and imported files. You will see something like this:

image of empty data pipes

Now we can simply click on Add Pipe to populate our model. Since we have created the tables from the outset, we can toggle the button to Select From Existing Tables, select the table we need, and Repods will fill out the appropriated Pipe Name automatically. After that, you can also select the correct source.

Let’s start by loading the Matches:

panel to create new data pipe

After we press Create, we are going to be redirected to Pipe Transformation. In the present panel, we can use the Datetime field, mentioned in Importing Files, to be the event date. You can simply copy and paste Code Snippet 2.1:

 to_date(r1."Datetime", 'DD Mon yyyy') as EVENT_DATE,                             
 r1."MatchID" as "ID",                                  -- bigint
 r1."RoundID" as "RoundID",                             -- bigint
 r1."Home Team Initials" as "HomeTeam",                 -- bigint
 r1."Away Team Initials" as "AwayTeam",                 -- bigint
 null as "VenueID"                     
S_WORLDCUPMATCHES r1                -- replace by your table name
WHERE r1."Datetime" is not null and r1."MatchID" is not null
Code Snippet 2.1 — Draft of Pipe Transformation for Matches

As you can see, we are converting the day, month, and year to a database date type, using it as an EVENT_DATE. Also, at this point, we don’t have anything to be used as “Venue ID”. So we are explicitly inserting null values here to remember to populate them later.

We can save this pipe transformation and flow the data into our table by unblocking the pipe. Press the button shown below:

image of pipe button

Now we can proceed by creating other pipes for the entities, following a sequence of snippets for each of them:

 to_date(r1."Year" || '-1-1', 'yyyy-mm-dd') as VALID_FROM,                             
 to_date(r1."Year" || '-12-31', 'yyyy-mm-dd') as VALID_TO,                               
 r1."Year" as "ID",                                   -- bigint
 r1."Country" as "Name"                               -- text
Code Snippet 2.2. World Cup Pipe Transformation

To simplify, we select the year of the event as a validity interval of the World Cup. It is also possible to derive this interval from our data, choosing the day of the first and last matches as VALID_FROMand VALID_TO, respectively.

For the remaining pipes, you can copy and paste the code snippets from this file. Use the code snippets numbered from 3.1 to 3.6:

After these basic pipes, we can see how the data flows and interconnects entities within Repods. If we want to run some analysis, we need to include more fields and more data on reports. Check out below for more transformations on Code Snippets from 3.7 to 3.10:

These snippets together are enough to load all the data we have from FIFA. With that, we can create Reports and Workbooks. Once the process is finalized, the pipes will look like what we have in the screenshot below:

image of data pipes flowing


Feel free to play with the Analysis section to create Reports and Workbooks. They are intuitive and easy to adjust. Follow Part II of our tutorial Data-driven Historical Stats from the FIFA World Cup between 1930 and 2014. There you will learn how to insert more data and create compelling reports.

After you have mastered data modeling, you can also take a look at the Infographics section of the Analysis panel. This is where you can prepare for our next tutorial where we show how to Create Your First Custom Infographics — Data Visualization with Repods.

Repods is a data platform where you create and manage Data Pods. Data Pods are compact data warehouses with flexible storage, vCores, and memory plus all the tooling needed for complex data analyses.

You can check out the Pod used on this tutorial here.

Subscribe to our blog!
Stay up to date with the latest data science and IoT tips and news.