avatar

Arseny Egorov

Data Engineering Blog

Loop: life DWH, logical and physical models. Part 1

#loop

Loop is my pet project. Pet projects are sandboxes where any sandcastles can be built with the fastest iteration cycle. To learn new material, I always need real-life examples and practice. If I read a book explaining what 3NF is, then the best way for me to understand it is to check some real, physical tables. In general, knowledge is really valuable when applied to real work. However, the pet project, as I said above, is my laboratory where I conduct experiments and put new ideas to initial tests.

I’ve needed my personal data warehouse dedicated to learning for a long time, and decided to start with the following idea:

  • I have Garmin watches and I’m tracking time spent on work, pet projects, learning new things, and gaming
  • So let’s collect, model, and transform this data in a data warehouse built from scratch
  • Then let’s build some dashboards on top of this data

These are examples of questions on which I want to get the answers from time to time:

  1. How much time did I spend by day/week/month working, learning, or gaming?
  2. How many workouts did I have by day/week/month?
  3. When did I first start to workout? How consistent am I?
  4. What are my median/average bedtime and wake-up times by week/month?

I’m a Data Engineer, so my hammer will be used for the nails:

  • S3 and Postgres for the raw and detailed layers
  • Apache Spark for data transformations
  • delta.io for the staging layer
  • Dagster for the orchestration
  • Metabase to build dashboards

In this and the following posts I’m going to tell how the thing was built. Here’s the final result – a dashboard which looks like this (screenshots from my Metabase):

You may see some flaky graphs like “Sleep Fall Asleep”, I decided to leave it as is, otherwise I will never publish this post. There is no limit to perfection.

Why “Loop”?

Don’t confuse with Polar’s Loop device! I didn’t have time to trademark the project name, so I’m sticking with “Loop” which sounds like “Whoop”. I used a Whoop band for nearly 9 months and loved it. The device is impressive: it works more than 1-2 days(hey Apple Watch), doesn’t have screen, and is solely focused on some magical “insights”.

Anyway, now I have Garmin and I’m absolutely happy. One day I’ll implement something what Whoop does, like analyzing correlations between Garmin metrics and working hours. I also plan to add Google Forms for daily mood and habits tracking in the future.

Sources

Let’s start by investigating what data we can collect.

Time tracking – EARLY

I’m tracking my time. What I’m tracking: working time, time spent on pet projects, on reading technical books, on learning languages, playing video games.

I track only deep and focused work, fully commiting to what I’m doing in the moment. If spend 8 hours in the office then 5-6 hours will be tracked. It’s just a metric. Maybe a large amount of hours can lead to better results, maybe not, there are no guarantees. It’s the same as possession in football, number of git commits made, HRV, or the “stress level”.

Anyway, with this data, trends can be spotted:

  • How much time I spent playing video games this week
  • Time spent on a meetings or in the focusing mode
  • I read a new book about databases, how much time it took

EARLY(formerly Timeular) is the time tracking app I use, paying for a subscription. It’s simple, minimal, and does 95% of what I expect. One of the selling points for EARLY is an API. They also sell Time Tracking Cube – physical cube (I have one), which you can use if you want physical feedback on tracking.

This is my current folders and activities layout. The hierarchy is rather simple and works perfectly for me, covering everything I’m doing on a daily basis.

Early categories

To get data from the API I will implement simple HTTP client with pydantic validation. What we will download:

  1. Folders
  2. Activities
  3. Tags
  4. Time entries

Workouts, sleep, steps and many more – Garmin watches

I’m a proud owner of a Forerunner 265. These watches are the best I’ve ever had. The top feature for me is the battery life – 5-7 days. Garmin collects a lot of metrics, but as a starting point we will focus on:

  1. Sleep
  2. Workouts

To download Garmin data I’ll use the garminconnect Python library.

Logical model

This section heavily uses concepts from the Database Design Book written by Alexey Makhotkin, which I read with pleasure before creating the logical model for the project. The book is highly recommended reading. You might also be interested in the author’s articles and author’s Substack Minimal Modeling. Also, I’m writing a review of the book. You can notice I use lots of links to Alexey’s articles throughout this post because they’re quite useful and the author covers many data modeling concepts.

Quote from the book: “Logical model exists even if you refuse to acknowledge it. When not written explicitly it’s scattered in 3 places: 1) physical db schema 2) system code 3) people’s memory.”

The Logical Model should become the single source of truth for my data warehouse. If I want to build a new feature, I’ll start with the Logical Model first and then move to the Physical Implementation. You may think this is redundant bureaucracy for the pet project, but I try not to be “tactical tornado” here.

After checking the EARLY API and garminconnect library for what data they provide, we can extract anchors, attributes, and links for the logical model.

Anchors

We start from extracting entities or anchors (nouns). I also added the “source” column so it’s clear where the data comes from.

Name ID examples Table name (physical) Sources
Time Folder “280326” time_folder EARLY api
Time Activity “2052354” time_activity EARLY api
Time Tag 14831771 time_tag EARLY api
Time Entry “106227041” time_entry EARLY api
Workout Type 21 workout_type Garmin Connect
Workout 20150174780 workout Garmin Connect
Sleep 1752277343000 sleep Garmin Connect

Attributes

I extracted this minimal set of attributes by answering questions.

Anchor Question Logical Data Type Example Column name and type (physical)
Time Folder What is the name of the folder? string “0. Vio” time_folder.name TEXT NOT NULL
Time Folder What is the description of the folder? string “Work” time_folder.description TEXT NOT NULL
Time Activity What is the name of the activity? string “Gaming” time_activity.name TEXT NOT NULL
Time Activity What is the status of the activity? enum “active”, “inactive”, “archived” time_activity.status ENUM NOT NULL
Time Activity What is the description of the activity? string “Playing video games” time_activity.description TEXT NOT NULL
Time Tag What is the label of the tag? string “crafting_interpreters” time_tag.label TEXT NOT NULL
Time Entry When the entry was started? timestamp in UTC “2025-08-16 06:08:43+00:00” time_entry.start_at TIMESTAMPTZ NOT NULL
Time Entry When the entry was ended? timestamp in UTC “2025-08-16 07:22:27+00:00” time_entry.end_at TIMESTAMPTZ NOT NULL
Workout Type What is the name of the type? string “yoga” workout_type.name TEXT NOT NULL
Workout What is the name of the workout? string “Yoga” workout.name TEXT NOT NULL
Workout When the workout started? timestamptz “2025-08-23 10:19:05+02:00” workout.start_at TIMESTAMPTZ NOT NULL
Workout When the workout ended? timestamptz “2025-08-23 10:31:56+02:00” workout.end_at TIMESTAMPTZ NOT NULL
Sleep When the sleep started? timestamptz “2025-08-22 21:49:46+02:00” sleep.start_at TIMESTAMPTZ NOT NULL
Sleep When the sleep ended? timestamptz “2025-08-23 05:57:46+02:00” sleep.end_at TIMESTAMPTZ NOT NULL

Let’s define relations between anchors.

Anchor1:Anchor2 Cardinality Sentences Table or column name (physical)
Time Folder : Time Activity 1:M Time Folder contains multiple Time Activities. Time Activity is part of only one Time Folder time_activity.time_folder_id
Time Folder : Time Tag 1:M Time Folder contains multiple Time Tags. Time Tag is part of only one Time Folder time_tag.time_folder_id
Time Activity : Time Entry 1:M Time Activity contains multiple Time Entries. Time Entry is part of only one Time Activity time_entry.time_activity_id
Time Entry : Time Tag M:N Time Entry is labeled with multiple Time Tags. Time Tag can label multiple Time Entries link_time_entry__time_tag
Workout Type : Workout 1:M Workout Type contains multiple Workouts. Workout can be only one of Workout Type workout.workout_type_id

Physical ERD diagram

I use d2 to visualize the ERD diagram. It’s a diagramming language that supports SQL tables and Crow’s foot notation.

I recommend reading these articles written by Alexey:

I didn’t add ERD notation column to the logical design though.

This is a d2 code snippet showing how this diagram is built:

time_folder: {
  shape: sql_table
  id: text {constraint: primary_key}
  name: text not null
  description: text not null
}

time_activity: {
  shape: sql_table
  id: text {constraint: primary_key}
  name: text not null
  status: enum not null
  description: text not null
  time_folder_id: text {constraint: foreign_key}
}

time_activity.time_folder_id <-> time_folder.id: {
  source-arrowhead.shape: cf-many
  target-arrowhead.shape: cf-one-required
}

# ...

Note that we use Foreign Keys in the design. Later we will relax this requirement. Why this should be relaxed is highlighted in the Foreign Keys @ Alexey Makhotkin article. The problem is that we need to load two connected entities in a single transaction. If we reload folders by simply overwriting the table, we truncate it, meaning all related articles and tags that depend on folders will be deleted. If there is a failure loading the folders, then the activities and tags tables will be empty.

Summary

What do we have so far:

  • Data sources were explored (EARLY and Garmin)
  • Logical model was built with anchors, attributes, and links
  • Physical design was created and ERD diagram was drawn

In the next post I will tell about DWH layers and raw data ingestion. Stay tuned and subscribe!

(c) Arseny Egorov
2025