SCD (๐’๐ฅ๐จ๐ฐ๐ฅ๐ฒ ๐‚๐ก๐š๐ง๐ ๐ข๐ง๐  ๐ƒ๐ข๐ฆ๐ž๐ง๐ฌ๐ข๐จ๐ง) in hive

ยท

2 min read

Let's #hive

๐Ÿ“Œ What is SCD and w๐ก๐š๐ญ ๐š๐ซ๐ž ๐ญ๐ก๐ž ๐ญ๐ฒ๐ฉ๐ž๐ฌ ๐จ๐Ÿ #๐’๐‚๐ƒ ๐ข๐ง ๐‡๐ข๐ฏ๐ž?

โœ” In the context of Hive (a data warehousing and SQL-like query language built on top of Hadoop), SCD stands for ๐’๐ฅ๐จ๐ฐ๐ฅ๐ฒ ๐‚๐ก๐š๐ง๐ ๐ข๐ง๐  ๐ƒ๐ข๐ฆ๐ž๐ง๐ฌ๐ข๐จ๐ง.

โœ” Slowly Changing Dimensions are used to track changes to dimensional data over time, allowing you to maintain historical information and analyze data at different points in time.

There are mainly three types of Slowly Changing Dimensions (SCD) in Hive:

โœ… ๐“๐ฒ๐ฉ๐ž 1 ๐’๐‚๐ƒ (๐‡๐ข๐ฌ๐ญ๐จ๐ซ๐ข๐œ๐š๐ฅ ๐Ž๐ฏ๐ž๐ซ๐ฐ๐ซ๐ข๐ญ๐ž):

โ–ช In Type 1 SCD, when a change occurs in a dimension record, the old record is simply updated with the new data, overwriting the existing values.

โ–ช This approach does not preserve historical data, and any previous versions of the record are lost.

โ–ช Type 1 SCD is suitable when historical data is not important, and you only need the latest version of the data.

โœ… ๐“๐ฒ๐ฉ๐ž 2 ๐’๐‚๐ƒ (๐‡๐ข๐ฌ๐ญ๐จ๐ซ๐ข๐œ๐š๐ฅ ๐“๐ซ๐š๐œ๐ค๐ข๐ง๐  - ๐€๐๐ ๐๐ž๐ฐ ๐‘๐ž๐œ๐จ๐ซ๐):

โ–ช Type 2 SCD is designed to preserve the history of dimension data by adding a new record whenever a change occurs.

โ–ช This new record contains the updated data along with an additional attribute, such as a timestamp or version number, to indicate when the change occurred.

โ–ช This approach maintains historical data but can lead to a larger dimension table over time.

โœ… ๐“๐ฒ๐ฉ๐ž 3 ๐’๐‚๐ƒ (๐‡๐ข๐ฌ๐ญ๐จ๐ซ๐ข๐œ๐š๐ฅ ๐“๐ซ๐š๐œ๐ค๐ข๐ง๐  - ๐€๐๐ ๐‚๐จ๐ฅ๐ฎ๐ฆ๐ง๐ฌ):

โ–ช Type 3 SCD involves adding new columns to the existing dimension table to track changes. For example, you might have attributes like "current_value" and "previous_value" to store the current and previous versions of the attribute.

โ–ช This approach keeps the dimension table relatively compact while still allowing some historical tracking.

โ–ช However, it is limited in the number of changes it can track compared to Type 2 SCD.

โ˜‘ Each SCD type has its advantages and use cases, and the choice of SCD type depends on the specific requirements of your data analysis and reporting needs.

ย