BLOG 4 | Summary of week 3 and 4

 

Week 3 | Lecture 6: Advanced Star Schema Design

In this lecture we learned some advanced concepts of the star schema design. Let’s take a look at them.

Advanced Dimensions concepts

Surrogate Keys

Surrogate keys playa very important role in dimensional modeling and in data warehousing.
They are keys that are generated by the data warehouse, used for joins between the dimensions and the fact table.
This could be achieved by using the primary keys of dimensions but that will make our fact table vulnerable to changes that can occur in primary key formats that could occur over time.
So we use surrogate keys which are essentially non intelligent sequential numbers that are used instead of primary keys for joining to the fact table.
Surrogate keys minimize the impact of changes in primary keys and increase operational efficiency

Degenerate Dimensions

Degenerate dimensions are like primary keys for a fact table. They are added as an attribute of the fact table with no dimension associated with it. Degenerate dimensions are used for grouping and enable market basket analysis.

Role Playing Dimension

Role playing dimensions is basically a table with multiple relationships between itself and another table. Another way to understand it is, if the same dimension is used more than once with different names in the cube, it’s a role-playing dimension. We can commonly see them in dimensions like time. For example, time dimension is broken down to order date dimension and ship date dimension to track the two different time values for when a order was placed and when it was shipped. That is the time dimension was divided and used for different purposes or roles and hence they’re called role playing dimensions.

Junk dimensions

These dimensions pertain to indicators with low cardinality values. That is indicators like “yes/ no”, “true or false”, or 1 and 0s that flag information like discount applied, credit card used, etc. These multiple low cardinality flags and indicators are collectively stored in a single dimension called the junk dimension. These dimensions are used to reduce the number of dimensions in the dimensional model (by collectively storing them in one place) and they also reduce the number of columns in the fact table. They can be used to make a data warehouse more manageable, and they increase query performance.

Slowly changing dimensions (SCD)

Usually, dimensions don’t change but there are instances where over time a certain value may change. These include entities as names of geographical locations, customers, or products. For example, a store location can change over time.

There are multiple ways to deal with slowly changing dimensions when they change values:

Type I

Method: Simply update the changed filed, overwriting the old value with the new one.

Drawback: you lose all the history associated with the previous value.

Type II

Method: Create a new record with the same primary key, a new surrogate key and an indicator.

The new surrogate key will indicate the same information as the primary key has not changes and the indicator (version, from and to dates, etc.) will help us identify the latest version of the information while preserving its history. Advantages: it preserves unlimited history about our data accurately.

Disadvantages: This method will increase the size of the table fast, which could lead to storage and performance issues over time

Type III

Method: Create a new column that stores the previous value of our information that’s changed.

Under this method a new column is added which helps us preserve the history of our data (previous value) and have its updated value (current value) in the same row.

Advantages: it does not increase the size of the table and it allows us to keep some part of our data history

Disadvantages: This method preserves only limited historical data for our record. As it just updated the pervious and current values, so this would not be suitable in cases where we want to preserve all our historical data

Hybrid Solution

Method: It is the combination of type 1, 2 and 3. In this method we have the current record that stores the current data, we have a historical record that stores the historical or changed data, we have the from and to date to keep a track of expired records and finally we have an indicator flag that indicates which record is the current record.

Advantages: This method helps preserving all historical data accurately and provides an easy way to compare current and historical data.

Disadvantages: This method adds a lot of complexity to our data warehouse and it stores a lot of redundant data.

Now we move on to facts, the different types of facts are:

Additive Facts

These facts are numeric, and they can be added across all dimensions. For example, number of items sold is an additive fact as it can be added across all dimensions like product, time and stores.

Semi Additive Facts

These facts are also numeric, but they cannot be added across all dimensions. These dimensions are measures that only make sense if aggregated across certain dimensions. For example, Inventory or balance amounts are semi additive as they can be aggregated across various dimensions but not time.

Non-Additive Facts

These facts are numeric, they can be a ratio, average or a percentage. These facts cannot be added across any dimensions. For example, profit margin, average, discounts are some values that cannot be added across any dimensions as they will give us incorrect results. However, they are still legitimate facts as facts can be about averages like average number of products sold per week.

Factless Fact Tables

These fact tables contain no numeric data and hence we can’t do any additions across any dimensions for these values. An example of a factless fact table would be a fact table containing just the surrogate keys of the dimensions its surrounded with.

These tables are used record events and perform analysis on them. For example, it can help us track sales that were on promotion but did not do so well in the market.

Snowflake Schema

The snowflake schema is a variant of the star schema. While a star schema contains just the fact and dimension tables, a snowflake schema contains the fact, dimensions and subdimensions tables. It is a kind of schema where we normalize the dimension tables, and it forms a snowflake like structure due to the inclusion of the subdimensions giving it the name snowflake schema.

Figure 1: Differences between Star Schema and Snowflake Schema 
Source https://www.guru99.com/star-snowflake-data-warehousing.html

Fact Constellation Schema

The fact constellation schema is a type of schema where there are multiple fact tables that share dimension tables. These can be viewed as a collection of several star schemas and hence are also known as galaxy schemas

Figure 2: Differences between Star Schema and Fact Constellation Schema 
Source https://www.geeksforgeeks.org/difference-between-star-schema-and-fact-constellation-schema/


Week 3 | Lecture 7 Data Quality Analysis

In this lecture we learned about data quality, data quality analysis, data profiling and cleansing techniques and master data management

Data profiling is the use of analytical techniques on a data set top understand and assess its content, structure and quantity.

We use data profiling for various purposes, a few of them are:

To determine if the existing data can be repurposed, if the data conforms to required standards, assess risks in integrating data, tracking data quality, to measure if the metadata descriptions accuracy, understand potential data challenges and enable data governance and master data management.

Data Profiling process includes various types of analysis like

verifying the accuracy to metadata descriptions for our data values, identifying representations for missing data, identifying default and invalid values, checking for data format inconsistencies, assessing the various integrities (domain, structural, and referential), assessing and discovering various business rules residing in our dataset

There are various data profiling software tools that automate lot of the data profiling processes and can be used by data scientists to save time and effort. They essentially automate the work for understanding the dataset by generating statistics (summaries, frequencies, variance etc.) for values and different formats in our dataset.

Steps of Data Profiling

Create a plan of action

In this step we decide, how are we going to conduct our profiling based on our needs and the nature of our dataset. It can be table by table, column by column, by referential integrity constrains or by business rules.

Interpret the results of profiling

After following our plan and conducting data profiling, we now have to interpret our results.  For instance, these results could tell us about our data quality (if there are missing values, inaccuracies, etc.). We will use these interpretations to plan our next step

Cleaning the data in preparation for ETL

In this step we plan and clean the data to improve its quality. This is done for preparing the data before it goes though the ETL processes or is loaded to a data mart or a warehouse

We then looked over an example of data profiling where we checked where we checked for nulls, missing values, actual values and cardinality in order to understand the completeness, uniqueness and the distinctness of our dataset.

We then performed integrity checks, where we checked all primary keys and foreign keys to make sure that they are unique, not null and pass the referential integrity test.

Once we did our integrity tests, we performed business rules check. Under this we gather all the business rules from our data set and perform checks on our data to see if it conforms to the defined business rules or not.

Then we discussed the software tools that can be used to conduct data quality analysis. One of which was the Gartner’s “Magic Quadrant” and the Ataccama tool ( I have mentioned a summary of what we learned in these tutorials under the readings section) 

Master Data Management

Master Data Management is the standardizing the naming and format for attributes in the dimensions and the fact tables, it verifies that no business rules are being violated, it establishes controls and governance processes for facts and dimensions. It involves checking and maintaining data quality and integrity of the tables as they evolve and grow over time.

Data quality analysis is a subset of a larger process called Master Data Management.

Master Data Management if performed properly streamlines and facilitates the ETL process. It is an important step in data warehouse development, especially for facts and dimensions. Therefore, it is important that the master data management process is performed properly before we feed our data to ETL processes, a mart or a warehouse.

Week 4 | Lecture 8 Dashboard Design and Its Use For Analytics

In this lecture we looked at how to design dashboards and use them for analysis

A Dashboard is a graphical representation of the current status and historical trends of an organization’s key performance indicators. Typically, it a single screen summary that enables viewing the performance of an organization at a glance. 

A dashboard is a component of the enterprise information system.

Components of a dashboard

A dashboard contains quantitative measures of performance, baselines to compare performance or how close you are to a target. A dashboard contains various measures that are compared and analyzed over time and geography, making those two an important component of the dashboard.

There are various types graphical and visual representations that we use to display our quantitative measures in a dashboard, these are called widgets. A widget can be many things from gauges, maps, bar graphs, text labels etc.

Important characteristics of a dashboard are that it provides a big picture summary of the dataset over time, it focuses on items needing more attention and it can quickly dig deep and fetch more information we want to see for our analyses

We discussed the various guidelines for creating a dashboard like it should be simple and comprehendible, it should summarize important data and also exceptions, it should not be distractive towards our objective, it should enable quick diagnosis and should use appropriate widgets to represent our measures

We also looked at the common pitfalls of dashboard designs. Although they are pretty simple and straight forward, we often make these mistakes, let have a quick look at them.

Our dashboard should not exceed the single page boundary, it should not present excessive detail, should not express measures in an indirect way, should not have variety that deviates from our objective, and it should not clutter the screen with unnecessary aesthetics and widgets.

Then, we looked at the types of dashboards

Geographic: these display information we want to present in a map by layering the contents on a map. These can be used by everyone for various purposes.

Real Time: these dashboards display real time data, supporting immediate decisions. These can be very useful for industries that have to make time critical decisions like stock market brokers or hospitals.

Performance management: these dashboards display key performance indicators. An example of this visualization would be using a balanced scorecard to represent a metrics. Typically, this is used by management to monitor and analyze performance of an organization

Analytical: these dashboards are used to find anomalies in large data sets. These visualizations use colors and isolation to offer more detail regarding the findings. Analytical dashboards are often uses by data analysts and back offices.

Custom: Custom dashboards are used when we have multiple sources of data which needs to be visualized in variety of ways according to the goals and needs of the users. These use different kinds of widgets and visualizations to deliver a screen that show the required features at a glance.

To understand how to use dashboards for analysis, we looked at a great ted talk by Hans Rosling (https://www.ted.com/talks/hans_rosling_the_best_stats_you_ve_ever_seen?subtitle=en)

Learnings from this video:

To create meaningful visualizations for your data, you must have a clear understanding of the data and the key performance indicators required for your objective

To understand your data you must see the big picture, focus just on the items that need attention and have the ability to quickly drill down into additional information that is needed to take action; Which he did throughout his presentation

Statistical meaningful representations of your data can help you find trends and exceptions that might not be clearly visible to everyone. This was shown by him debunking popular beliefs about third world countries and their trends over time.

He included just the measures that were required to aid his objective and nothing more. This helped him show a clear big picture analysis that was focused to his objective and easy to understand.

Lastly, his visualizations were simple, clear, interactive, and animated. This kept the audience engaged and helped them understand the trends he was talking about with ease.

Readings

DQ Analyzer Tutorial

This was ataccama’s DQ Analyzer tutorial site where we learned:

how to create a profile

how to understand and interpret the profiling results
This was explained by a video where we took an example dataset and viewed the profiling results, which included the broad level statistics on the left and detailed information about the selected columns on the right. This essentially goes though the checks that we discussed in our lecture.

Advanced profiling
These are functionalities that can be found and set up in the properties of profiling dialog boxes which surmounts to a plan which we run, and it gives us the analysis with our columns, primary keys and business rules. These are then used to predict data quality, make a plan and or perform cleansing of our data
 
Last we look at Email Analysis
In this tutorial we learn how to standardize email addresses that are extracted form our data source. This can be achieved by using regex matching functions and functions to conditionally separate and merge data flows.
This demonstrates the advanced profiling functions and adds the analyses to the example profile. The analysis includes additional mask analysis, drill-through, business rule and primary key analysis

OCDQ Blog by Jim Harris

Under adventures in data profiling is a seven-part blog in data profiling where is covered the various features of a data profiling tool. The first part talks about the checks (aligning with our lecture) null. Missing, actual, completeness, cardinality, uniqueness, and distinctiveness. 
Part two talks about the drill down functionality for a field summary provided by the tool where we try to answer if customer_ID is a primary key or not.
In Part three we go over the analysis of combination of filed values and their formats (email address, telephone number and birth date). 
Part four, we analyze postal addresses, and their formats present in our dataset where we found how forcing international addresses to our predefined formats was causing data quality issues. 
In part five, we further analyze this situation and perform a postal address validation. 
In part six, we explore account number and tax Id fields and their formats. We also looked for possible data redundancy for the same. 
In part seven, we take a look at customer name fields which have high cardinality and distinctiveness. We also talk about the challenges that come with customer names and their validation due to no standards present for a person name. 
Lastly, in part eight we conclude our analysis by listing all the analysis we performed on our data categorized into counts and percentages, values and formats and drill down analysis of our data.

Dash Board Design - Beyond Meters, Gauges and Traffic Lights

In this article we learn the fundamental challenges of dashboard designs and the characteristics required for dashboards to meet these design challenges. They include exceptional organization, clear, concise and small display widgets, emphasis on summaries and exceptions and information that is finely customized for a task. Then we explore the common types of dashboard information, these include measures over time, how is it being measured, and relationships that enhance their meaning with evaluating contexts. These also include relationships and predictions over multiple measures or temporary instances of a measure. Moving on, we learn about effective dashboard display widgets using the automobile dashboard metaphor. Under this, we essentially explore the various types of widgets, the information they can help to visualize so that we get a better understanding of what widget to use for a particular set of data. Finally, we put it all together in a single page for a dashboard to view our key performance indicators, statistics and summaries  at a glance.

Common Pitfalls in Dashboard Design

This article talks about the common dashboard pitfalls, extending the ones that we already looked at in our lecture. It enlists thirteen common pitfalls in dashboard designing and discusses them in detail. The pitfalls are: Exceeding the boundaries of a single screen, Supplying inadequate context for the data, Displaying excessive detail or precision, Expressing measures indirectly, Choosing inappropriate media of display, Introducing meaningless variety, Using poorly designed display media, Encoding quantitative data inaccurately, Arranging the data poorly, Ineffectively highlighting what’s important, Cluttering the screen with useless decoration, Misusing or overusing color, and Designing an unappealing visual display.  These resonate closely to what we already explored and understood form out lecture. This article tries to explain its reader the pitfalls to avoid and the desired characteristics for a dashboard which are seeing the big picture, focusing just on the items that need attention and the ability to quickly drill down into additional information that is needed to take action.

The_5_most_influential_data_visualizations_of_all_time

This is a tableau presentation that talks about data visualizations and the top most influential ones of all time. 
These include the London Cholera Map by Jhon Snow. He plotted every death on a map and realized that the water supply was the source of the deadly disease. 
Next, we have the Gapminder by Hans Rosling ( we also saw his ted talk that used this visualization software) who showed us the relationship between income and life expectancy. 
Third, we have March on Moscow by Charles Minard who used the simple temperature line to depict the horrors of the war and how cold ultimately defeated Napoleons army. 
On number two, we have the war mortality by Florence Nightingale who used this visualization to convince the British army to invest in hospitals instead of guns and ammunition. 
Last, we have the chart of biography by Joseph Priestly who used his visualizations to explain and document both the rise and fall of the empires and unique thinkers that defined those nations

 Articles 

Week 3 | Star Schema Examples

https://www.sciencedirect.com/topics/computer-science/star-schema?fbclid=IwAR2_TTcmJKBvOhDVBeT4hX4ZjbyA_WLWuDEL-rwk71njJCNLuUU5_RUZP38

This week we focused heavily on star schemas. During my research for various examples of star schemas for our assignment case study ( from our book, lectures, and the web) I stumbled upon this gem! It's a collective list of star schema examples from various books. I particularly enjoyed studying the one with corporate groups in the airline industry. (in Building a Scalable Data Warehouse with Data Vault 2.0, 2016)
This example talks about the various dimensions involved and their relationships as expected
then it talks about the indirect relationship between two dimensions that were referenced independently in the fact table and how the star schema is missing out on including business rules that come from these inter-dimension relationships, which was depicted in a SQL query example
Then, it shows how using the snowflake schema, tackles this issue as it allows indirect dimensions to be included and then backs it up using a SQL query example.
This example really helped strengthen my understanding of both star and snowflake schemas

Week 4 | BI Dashboard Best Practices 

https://www.datapine.com/articles/bi-dashboard-best-practices

Since in week 4 we primarily focused on dashboards, this article seems like the perfect choice for it. This article talks defines a BI dashboard and the various benefits that comes with it. The primary benefits it talks about are Trend identification, Increased efficiency, Increased Accuracy, Interactive data visualization, Self-service features, Enhanced communication, Accurate forecasting, Real-time insights, and Freedom & flexibility.  Moving on, the article discusses the best practices for BI dashboard designing,

1. IDENTIFY YOUR REPORTING REQUIREMENTS
The first thing you need to clarify is the “Ws”: why do you need to report, and to whom? List all the stakeholders involved as well as the decision-maker and end-users

2. KNOW YOUR DASHBOARDS
knowing what type of dashboard serves your objective the best. Types of dashboards: Strategic, Analytical, Operational and Tactical.

3. DEFINE THE NEEDED DATA SOURCES
A good practice here is to use data sources that will bring the most value to your specific goal tracking process

4. DESIGN DATA TO AVOID CLUTTER
you should focus on concise and effective design. Effective communication should always be your number one priority.
This point discusses the six principles that were defined by the Gestalt school of Psychology as the main principles outlining basic human interaction and order creation within the context of visual stimulation (I personally loved this section)

5. DESIGN DATA WITH A CLEAR VISUAL ORDER
Here we look at the elements of a well-ordered dashboard, engineered for visual success. These are Consistent Alignment,  White Space, and Use of Contrast

6. CHOOSE THE RIGHT DATA VISUALIZATIONS
display data in a way that resonates with your audience on a deeper, more meaningful level. This will ensure that the data is transformed into actionable insights that help drive the business forward.

7. USE INTERACTIVE FILTERS
Can be achieved through various filters like Drill downs, Drill through and Click-to-filter

8. PROVIDE DATA CONTEXT
Data analyses and visual representations lacking context will be limited in their utility and ROI

9. WEB-BASED ACCESS
Web-based data sharing and analyses reach a wider audience and allow for greater collaboration with other solutions

10. CONSIDER YOUR TARGET AUDIENCE
Considering your target audience is one of the most integral practices for BI dashboarding

11. TELL A TALE WITH YOUR METRICS
Storytelling is a powerful best practice for any business, regardless of industry or sector. A compelling narrative will capture the attention of your audience and will also break down your findings in an easy to follow form.

12. REFINE - PERFECT - REPEAT
The digital world is in a constant state of flux, and to remain adaptable to change, it’s vital to make continual improvements.

This is followed by neat BI dashboard examples to help us understand the points discussed above. 
 
Do let me know if my summary helped you better understand a concept or if you have any feedbacks in the comments below!

Comments

  1. Hi Yashree, You have summarized very well. I like your BI dashboard best practice article and explanation about all the 12 points which we should while building the dashboard. Good comparison between Star and Snowflake schemas.

    ReplyDelete
    Replies
    1. Thank you for taking the time to read and comment on my blog Pushpa!
      I always learn better when I have clear differentiations between concepts and real life examples to link to the learnings and so I thought including them in my summary could turn out to be helpful for others as well :)

      Delete
  2. Yashree, I agree with Pushpa that your summary is very detailed. I like how you broke down the advantages and disadvantages of the dimension types, and the details of your dashboard design. Very well done!

    ReplyDelete
    Replies
    1. Hi Adrienne, Thank you for taking the time to read and commenting on my blog!
      My nerd self does better with additional information like differences, advantages and disadvantages as they give a context on the uses and the needs for newer models and types; That's why I included them in my summary, hoping they would help others to understand and tie the topics together in their mind better :)

      Delete
  3. Hi Yashree, I really liked the way you laid out your summary of the past few week's material. You were extremely thorough, great job! I couldn't agree more with the BI best practices you outlined. Stakeholder management is such an important step in this process. It is invaluable to get in front of your stakeholders early and often to ensure that your development lines up with what they are trying to achieve. There is nothing worse than being weeks into development and having to scrap views! One other point I would add is a final QA step. There are a lot of moving parts to developing dashboards, which opens up opportunity for error. It is critical to QA the final output of your dashboards to ensure they line up with raw data/queries.

    ReplyDelete
  4. Hello Yashree, I really enjoyed your summary of Data Warehouse Design. I definitely need to start referring to your blog for reviewing. As someone who has typically uses PowerBI for work, I enjoyed the BI Dashboard for Best Practices article. I definitely try to follow most of these points but the one I find more prolific is number 11, "TELL A TALE WITH YOUR METRICS". Overall data is really boring, but I do find it enjoyable when my managers make connections from charts that I made. Besides the Tableau assignment, do you have any personal experience with BI Dashboards?

    ReplyDelete
  5. Wow, very thorough summary of the topic! I like the list of dashboard best-practices at the end. I feel like deciding the right visual to use is often the most difficult and ties in with other best-practices like knowing your audience and telling a story. I feel there are often multiple visuals that work for a situation and a lot of it depends on the story that the creator wants to tell. But I always find myself second-guessing my choices of visuals and wondering if a different visual type would have more clearly articulated the finding that I wanted to show.

    ReplyDelete

Post a Comment