BLOG 4 | Summary of week 3 and 4
Week 3 | Lecture 6: Advanced Star Schema Design
In this lecture we learned some advanced concepts of the
star schema design. Let’s take a look at them.
Advanced Dimensions concepts
Surrogate Keys
They are keys that are generated by the data warehouse, used for joins between the dimensions and the fact table.
This could be achieved by using the primary keys of dimensions but that will make our fact table vulnerable to changes that can occur in primary key formats that could occur over time.
So we use surrogate keys which are essentially non intelligent sequential numbers that are used instead of primary keys for joining to the fact table.
Surrogate keys minimize the impact of changes in primary keys and increase operational efficiency
Degenerate Dimensions
Degenerate dimensions are like primary keys for a fact
table. They are added as an attribute of the fact table with no dimension
associated with it. Degenerate dimensions are used for grouping and enable
market basket analysis.
Role Playing Dimension
Role playing dimensions is basically a table with multiple
relationships between itself and another table. Another way to understand it
is, if the same dimension is used more than once with different names in the
cube, it’s a role-playing dimension. We can commonly see them in dimensions
like time. For example, time dimension is broken down to order date dimension
and ship date dimension to track the two different time values for when a order
was placed and when it was shipped. That is the time dimension was divided and
used for different purposes or roles and hence they’re called role playing
dimensions.
Junk dimensions
These dimensions pertain to indicators with low cardinality
values. That is indicators like “yes/ no”, “true or false”, or 1 and 0s that
flag information like discount applied, credit card used, etc. These multiple
low cardinality flags and indicators are collectively stored in a single
dimension called the junk dimension. These dimensions are used to reduce the
number of dimensions in the dimensional model (by collectively storing them in
one place) and they also reduce the number of columns in the fact table. They
can be used to make a data warehouse more manageable, and they increase query
performance.
Slowly changing dimensions (SCD)
Usually, dimensions don’t change but there are instances
where over time a certain value may change. These include entities as names of
geographical locations, customers, or products. For example, a store location
can change over time.
There are multiple ways to deal with slowly changing
dimensions when they change values:
Type I
Method: Simply update the changed filed, overwriting the old
value with the new one.
Drawback: you lose all the history associated with the
previous value.
Type II
Method: Create a new record with the same primary key, a new
surrogate key and an indicator.
The new surrogate key will indicate the same information as
the primary key has not changes and the indicator (version, from and to dates,
etc.) will help us identify the latest version of the information while
preserving its history. Advantages: it preserves unlimited history about our
data accurately.
Disadvantages: This method will increase the size of the
table fast, which could lead to storage and performance issues over time
Type III
Method: Create a new column that stores the previous value
of our information that’s changed.
Under this method a new column is added which helps us
preserve the history of our data (previous value) and have its updated value (current
value) in the same row.
Advantages: it does not increase the size of the table and
it allows us to keep some part of our data history
Disadvantages: This method preserves only limited historical
data for our record. As it just updated the pervious and current values, so
this would not be suitable in cases where we want to preserve all our
historical data
Hybrid Solution
Method: It is the combination of type 1, 2 and 3. In this
method we have the current record that stores the current data, we have a
historical record that stores the historical or changed data, we have the from
and to date to keep a track of expired records and finally we have an indicator
flag that indicates which record is the current record.
Advantages: This method helps preserving all historical data
accurately and provides an easy way to compare current and historical data.
Disadvantages: This method adds a lot of complexity to our
data warehouse and it stores a lot of redundant data.
Now we move on to facts, the different types of facts are:
Additive Facts
These facts are numeric, and they can be added across all
dimensions. For example, number of items sold is an additive fact as it can be
added across all dimensions like product, time and stores.
Semi Additive Facts
These facts are also numeric, but they cannot be added
across all dimensions. These dimensions are measures that only make sense if
aggregated across certain dimensions. For example, Inventory or balance amounts
are semi additive as they can be aggregated across various dimensions but not
time.
Non-Additive Facts
These facts are numeric, they can be a ratio, average or a
percentage. These facts cannot be added across any dimensions. For example,
profit margin, average, discounts are some values that cannot be added across
any dimensions as they will give us incorrect results. However, they are still
legitimate facts as facts can be about averages like average number of products
sold per week.
Factless Fact Tables
These fact tables contain no numeric data and hence we can’t
do any additions across any dimensions for these values. An example of a
factless fact table would be a fact table containing just the surrogate keys of
the dimensions its surrounded with.
These tables are used record events and perform analysis on
them. For example, it can help us track sales that were on promotion but did
not do so well in the market.
Snowflake Schema
The snowflake schema is a variant of the star schema. While
a star schema contains just the fact and dimension tables, a snowflake schema
contains the fact, dimensions and subdimensions tables. It is a kind of schema
where we normalize the dimension tables, and it forms a snowflake like
structure due to the inclusion of the subdimensions giving it the name
snowflake schema.
Fact Constellation Schema
The fact constellation schema is a type of schema where
there are multiple fact tables that share dimension tables. These can be viewed
as a collection of several star schemas and hence are also known as galaxy
schemas
Week 3 | Lecture 7 Data Quality Analysis
In this lecture we learned about data quality, data quality
analysis, data profiling and cleansing techniques and master data management
Data profiling is the use of analytical techniques on a data
set top understand and assess its content, structure and quantity.
We use data profiling for various purposes, a few of them
are:
To determine if the existing data can be repurposed, if the
data conforms to required standards, assess risks in integrating data, tracking
data quality, to measure if the metadata descriptions accuracy, understand
potential data challenges and enable data governance and master data
management.
Data Profiling process includes various types of analysis
like
verifying the accuracy to metadata descriptions for our data values, identifying representations for missing data, identifying default and invalid values, checking for data format inconsistencies, assessing the various integrities (domain, structural, and referential), assessing and discovering various business rules residing in our dataset
There are various data profiling software tools that
automate lot of the data profiling processes and can be used by data scientists
to save time and effort. They essentially automate the work for understanding
the dataset by generating statistics (summaries, frequencies, variance etc.)
for values and different formats in our dataset.
Steps of Data Profiling
Create a plan of action
In this step we decide, how are we going to conduct our
profiling based on our needs and the nature of our dataset. It can be table by
table, column by column, by referential integrity constrains or by business
rules.
Interpret the results of profiling
After following our plan and conducting data profiling, we
now have to interpret our results. For
instance, these results could tell us about our data quality (if there are
missing values, inaccuracies, etc.). We will use these interpretations to plan
our next step
Cleaning the data in preparation for ETL
In this step we plan and clean the data to improve its
quality. This is done for preparing the data before it goes though the ETL
processes or is loaded to a data mart or a warehouse
We then looked over an example of data profiling where we
checked where we checked for nulls, missing values, actual values and
cardinality in order to understand the completeness, uniqueness and the
distinctness of our dataset.
We then performed integrity checks, where we checked all
primary keys and foreign keys to make sure that they are unique, not null and
pass the referential integrity test.
Once we did our integrity tests, we performed business rules
check. Under this we gather all the business rules from our data set and
perform checks on our data to see if it conforms to the defined business rules
or not.
Then we discussed the software tools that can be used to
conduct data quality analysis. One of which was the Gartner’s “Magic Quadrant”
and the Ataccama tool ( I have mentioned a summary of what we learned in these tutorials under the readings section)
Master Data Management
Master Data Management is the standardizing the naming and
format for attributes in the dimensions and the fact tables, it verifies that
no business rules are being violated, it establishes controls and governance
processes for facts and dimensions. It involves checking and maintaining data
quality and integrity of the tables as they evolve and grow over time.
Data quality analysis is a subset of a larger process called
Master Data Management.
Master Data Management if performed properly streamlines and
facilitates the ETL process. It is an important step in data warehouse
development, especially for facts and dimensions. Therefore, it is important
that the master data management process is performed properly before we feed
our data to ETL processes, a mart or a warehouse.
Week 4 | Lecture 8 Dashboard Design and Its Use For Analytics
In this lecture we looked at how to design dashboards and
use them for analysis
A Dashboard is a graphical representation of the current
status and historical trends of an organization’s key performance indicators.
Typically, it a single screen summary that enables viewing the performance of
an organization at a glance.
A dashboard is a component of the enterprise information
system.
Components of a dashboard
A dashboard contains quantitative measures of performance,
baselines to compare performance or how close you are to a target. A dashboard
contains various measures that are compared and analyzed over time and geography,
making those two an important component of the dashboard.
There are various types graphical and visual representations
that we use to display our quantitative measures in a dashboard, these are
called widgets. A widget can be many things from gauges, maps, bar graphs, text
labels etc.
Important characteristics of a dashboard are that it
provides a big picture summary of the dataset over time, it focuses on items
needing more attention and it can quickly dig deep and fetch more information
we want to see for our analyses
We discussed the various guidelines for creating a dashboard
like it should be simple and comprehendible, it should summarize important data
and also exceptions, it should not be distractive towards our objective, it
should enable quick diagnosis and should use appropriate widgets to represent
our measures
We also looked at the common pitfalls of dashboard designs.
Although they are pretty simple and straight forward, we often make these
mistakes, let have a quick look at them.
Our dashboard should not exceed the single page boundary, it
should not present excessive detail, should not express measures in an indirect
way, should not have variety that deviates from our objective, and it should
not clutter the screen with unnecessary aesthetics and widgets.
Then, we looked at the types of dashboards
Geographic: these display information we want to present in
a map by layering the contents on a map. These can be used by everyone for
various purposes.
Real Time: these dashboards display real time data,
supporting immediate decisions. These can be very useful for industries that
have to make time critical decisions like stock market brokers or hospitals.
Performance management: these dashboards display key
performance indicators. An example of this visualization would be using a
balanced scorecard to represent a metrics. Typically, this is used by
management to monitor and analyze performance of an organization
Analytical: these dashboards are used to find anomalies in
large data sets. These visualizations use colors and isolation to offer more
detail regarding the findings. Analytical dashboards are often uses by data
analysts and back offices.
Custom: Custom dashboards are used when we have multiple
sources of data which needs to be visualized in variety of ways according to
the goals and needs of the users. These use different kinds of widgets and
visualizations to deliver a screen that show the required features at a glance.
To understand how to use dashboards for analysis, we looked
at a great ted talk by Hans Rosling
(https://www.ted.com/talks/hans_rosling_the_best_stats_you_ve_ever_seen?subtitle=en)
Learnings from this video:
To create meaningful visualizations for your data, you must
have a clear understanding of the data and the key performance indicators
required for your objective
To understand your data you must see the big picture, focus just on
the items that need attention and have the ability to quickly drill down into additional
information that is needed to take action; Which he did throughout his
presentation
Statistical meaningful representations of your data can help
you find trends and exceptions that might not be clearly visible to everyone.
This was shown by him debunking popular beliefs about third world countries and
their trends over time.
He included just the measures that were required to aid his
objective and nothing more. This helped him show a clear big picture analysis
that was focused to his objective and easy to understand.
Lastly, his visualizations were simple, clear, interactive,
and animated. This kept the audience engaged and helped them understand the
trends he was talking about with ease.
Readings
DQ Analyzer Tutorial
This was ataccama’s DQ Analyzer tutorial site where we learned:
how to understand and interpret the profiling results
This was explained by a video where we took an example dataset and viewed the profiling results, which included the broad level statistics on the left and detailed information about the selected columns on the right. This essentially goes though the checks that we discussed in our lecture.
Advanced profiling
These are functionalities that can be found and set up in the properties of profiling dialog boxes which surmounts to a plan which we run, and it gives us the analysis with our columns, primary keys and business rules. These are then used to predict data quality, make a plan and or perform cleansing of our data
Last we look at Email Analysis
In this tutorial we learn how to standardize email addresses that are extracted form our data source. This can be achieved by using regex matching functions and functions to conditionally separate and merge data flows.
OCDQ Blog by Jim Harris
Part two talks about the drill down functionality for a field summary provided by the tool where we try to answer if customer_ID is a primary key or not.
In Part three we go over the analysis of combination of filed values and their formats (email address, telephone number and birth date).
Part four, we analyze postal addresses, and their formats present in our dataset where we found how forcing international addresses to our predefined formats was causing data quality issues.
In part five, we further analyze this situation and perform a postal address validation.
In part six, we explore account number and tax Id fields and their formats. We also looked for possible data redundancy for the same.
In part seven, we take a look at customer name fields which have high cardinality and distinctiveness. We also talk about the challenges that come with customer names and their validation due to no standards present for a person name.
Lastly, in part eight we conclude our analysis by listing all the analysis we performed on our data categorized into counts and percentages, values and formats and drill down analysis of our data.
Dash Board Design - Beyond Meters,
Gauges and Traffic Lights
In this article we learn the
fundamental challenges of dashboard designs and the characteristics required
for dashboards to meet these design challenges. They include exceptional
organization, clear, concise and small display widgets, emphasis on summaries
and exceptions and information that is finely customized for a task. Then we
explore the common types of dashboard information, these include measures over
time, how is it being measured, and relationships that enhance their meaning
with evaluating contexts. These also include relationships and predictions over
multiple measures or temporary instances of a measure. Moving on, we learn
about effective dashboard display widgets using the automobile dashboard metaphor.
Under this, we essentially explore the various types of widgets, the
information they can help to visualize so that we get a better understanding of
what widget to use for a particular set of data. Finally, we put it all together
in a single page for a dashboard to view our key performance indicators, statistics
and summaries at a glance.
Common Pitfalls in Dashboard Design
This article talks about the common dashboard pitfalls, extending
the ones that we already looked at in our lecture. It enlists thirteen common
pitfalls in dashboard designing and discusses them in detail. The pitfalls are:
Exceeding
the boundaries of a single screen, Supplying
inadequate context for the data, Displaying excessive detail or precision, Expressing measures indirectly, Choosing inappropriate media of display, Introducing meaningless variety, Using poorly designed
display media, Encoding
quantitative data inaccurately, Arranging the data poorly, Ineffectively highlighting what’s important, Cluttering the screen with useless decoration, Misusing or overusing color, and Designing an unappealing visual display. These resonate closely to what we already explored
and understood form out lecture. This article tries to explain its reader the
pitfalls to avoid and the desired characteristics for a dashboard which are seeing
the big picture, focusing just on the items that need attention and the ability
to quickly drill down into additional information that is needed to take
action.
The_5_most_influential_data_visualizations_of_all_time
These include the London Cholera Map by Jhon Snow. He plotted every death on a map and realized that the water supply was the source of the deadly disease.
Next, we have the Gapminder by Hans Rosling ( we also saw his ted talk that used this visualization software) who showed us the relationship between income and life expectancy.
Third, we have March on Moscow by Charles Minard who used the simple temperature line to depict the horrors of the war and how cold ultimately defeated Napoleons army.
On number two, we have the war mortality by Florence Nightingale who used this visualization to convince the British army to invest in hospitals instead of guns and ammunition.
Last, we have the chart of biography by Joseph Priestly who used his visualizations to explain and document both the rise and fall of the empires and unique thinkers that defined those nations
Articles
Week 3 | Star Schema Examples
https://www.sciencedirect.com/topics/computer-science/star-schema?fbclid=IwAR2_TTcmJKBvOhDVBeT4hX4ZjbyA_WLWuDEL-rwk71njJCNLuUU5_RUZP38
This example talks about the various dimensions involved and their relationships as expected
then it talks about the indirect relationship between two dimensions that were referenced independently in the fact table and how the star schema is missing out on including business rules that come from these inter-dimension relationships, which was depicted in a SQL query example
Then, it shows how using the snowflake schema, tackles this issue as it allows indirect dimensions to be included and then backs it up using a SQL query example.
This example really helped strengthen my understanding of both star and snowflake schemas
Week 4 | BI Dashboard Best Practices
https://www.datapine.com/articles/bi-dashboard-best-practices
Since in week 4 we primarily focused on dashboards, this article seems like the perfect choice for it. This article talks defines a BI dashboard and the various benefits that comes with it. The primary benefits it talks about are Trend identification, Increased efficiency, Increased Accuracy, Interactive data visualization, Self-service features, Enhanced communication, Accurate forecasting, Real-time insights, and Freedom & flexibility. Moving on, the article discusses the best practices for BI dashboard designing,
The first thing you need to clarify is the “Ws”: why do you need to report, and to whom? List all the stakeholders involved as well as the decision-maker and end-users
2. KNOW YOUR DASHBOARDS
knowing what type of dashboard serves your objective the best. Types of dashboards: Strategic, Analytical, Operational and Tactical.
3. DEFINE THE NEEDED DATA SOURCES
A good practice here is to use data sources that will bring the most value to your specific goal tracking process
4. DESIGN DATA TO AVOID CLUTTER
you should focus on concise and effective design. Effective communication should always be your number one priority.
This point discusses the six principles that were defined by the Gestalt school of Psychology as the main principles outlining basic human interaction and order creation within the context of visual stimulation (I personally loved this section)
5. DESIGN DATA WITH A CLEAR VISUAL ORDER
Here we look at the elements of a well-ordered dashboard, engineered for visual success. These are Consistent Alignment, White Space, and Use of Contrast
6. CHOOSE THE RIGHT DATA VISUALIZATIONS
display data in a way that resonates with your audience on a deeper, more meaningful level. This will ensure that the data is transformed into actionable insights that help drive the business forward.
7. USE INTERACTIVE FILTERS
Can be achieved through various filters like Drill downs, Drill through and Click-to-filter
8. PROVIDE DATA CONTEXT
Data analyses and visual representations lacking context will be limited in their utility and ROI
9. WEB-BASED ACCESS
Web-based data sharing and analyses reach a wider audience and allow for greater collaboration with other solutions
10. CONSIDER YOUR TARGET AUDIENCE
Considering your target audience is one of the most integral practices for BI dashboarding
11. TELL A TALE WITH YOUR METRICS
Storytelling is a powerful best practice for any business, regardless of industry or sector. A compelling narrative will capture the attention of your audience and will also break down your findings in an easy to follow form.
12. REFINE - PERFECT - REPEAT
The digital world is in a constant state of flux, and to remain adaptable to change, it’s vital to make continual improvements.
Hi Yashree, You have summarized very well. I like your BI dashboard best practice article and explanation about all the 12 points which we should while building the dashboard. Good comparison between Star and Snowflake schemas.
ReplyDeleteThank you for taking the time to read and comment on my blog Pushpa!
DeleteI always learn better when I have clear differentiations between concepts and real life examples to link to the learnings and so I thought including them in my summary could turn out to be helpful for others as well :)
Yashree, I agree with Pushpa that your summary is very detailed. I like how you broke down the advantages and disadvantages of the dimension types, and the details of your dashboard design. Very well done!
ReplyDeleteHi Adrienne, Thank you for taking the time to read and commenting on my blog!
DeleteMy nerd self does better with additional information like differences, advantages and disadvantages as they give a context on the uses and the needs for newer models and types; That's why I included them in my summary, hoping they would help others to understand and tie the topics together in their mind better :)
Hi Yashree, I really liked the way you laid out your summary of the past few week's material. You were extremely thorough, great job! I couldn't agree more with the BI best practices you outlined. Stakeholder management is such an important step in this process. It is invaluable to get in front of your stakeholders early and often to ensure that your development lines up with what they are trying to achieve. There is nothing worse than being weeks into development and having to scrap views! One other point I would add is a final QA step. There are a lot of moving parts to developing dashboards, which opens up opportunity for error. It is critical to QA the final output of your dashboards to ensure they line up with raw data/queries.
ReplyDeleteHello Yashree, I really enjoyed your summary of Data Warehouse Design. I definitely need to start referring to your blog for reviewing. As someone who has typically uses PowerBI for work, I enjoyed the BI Dashboard for Best Practices article. I definitely try to follow most of these points but the one I find more prolific is number 11, "TELL A TALE WITH YOUR METRICS". Overall data is really boring, but I do find it enjoyable when my managers make connections from charts that I made. Besides the Tableau assignment, do you have any personal experience with BI Dashboards?
ReplyDeleteWow, very thorough summary of the topic! I like the list of dashboard best-practices at the end. I feel like deciding the right visual to use is often the most difficult and ties in with other best-practices like knowing your audience and telling a story. I feel there are often multiple visuals that work for a situation and a lot of it depends on the story that the creator wants to tell. But I always find myself second-guessing my choices of visuals and wondering if a different visual type would have more clearly articulated the finding that I wanted to show.
ReplyDelete