Data Warehousing

Saturday, September 9, 2017

Everything about Facts!

Measures are the essence of a FACT table, they are Numeric and can be Aggregated storing values that are of business interests. Each row present in a fact table corresponds to a measurement event.

The data on each row is at a speciﬁc level of detail, referred to as the grain, such as one row per product sold on a sales transaction and all the measurement rows must be at the same grain.

Measures defined above can be of the following types-

ADDITIVE: Facts/Measures that can be summarized along all dimensions. Eg Sales

SEMI-ADDITIVE: Facts/Measures that can be added across some but not all the dimensions. Eg Account Balance which cannot be summarized across time dimension.

NON-ADDITIVE: These are typically a calculated measure and cannot be summarized across the dimensions. Eg Unit Price

Fact tables store surrogate keys which are the business keys from the table that was used as the primary source for fact table. These surrogate keys ca together form a Composite Primary Key for the fact table.

Fact tables have the following column types:

Foreign Keys
Measures
Lineage Columns(optional)
Business Key columns from primary source tables (optional)

Fact tables tend to be deep in terms of the number of rows, but narrow in terms of the number of columns.

Implementing Fact Tables

Facts are always implemented after the Dimension tables as it is on the "many" side of the relationship with dimensions. If the facts tables are large they are then partitioned for easier maintenance and better performance.

**It is not necessary that all the foreign keys together uniquely identify each row of a fact table.

Thursday, August 31, 2017

Recovery Models for a Data Warehouse

As Data Warehouse is not real time its back-up can be supported using the following types of Recovery Models-

Full Recovery Model:

Here all the transactions are fully logged with all the associated data and back ups are made regularly. Full is selected when point-in-time recovery is needed.

Bulk Logged Recovery Model:

Here the T-SQL commands are logged but without all associated data. It permits high-performance bulk copy operations such as Index creation/bulk loading of text/XML.

Simple Recovery Model:

SQL server reclaims log space for committed transactions.

Monday, July 18, 2016

Why is Unstructured Data Complicated!

Why is Unstructured Data Complicated!

Structured systems like ATM Transactions, Airline Reservations, Manufacturing Inventory Control Systems, Point of sale systems and many more are a part of the daily operational activities and grow quickly. These are defined in rules and are supported by analytical systems.

However, Unstructured data such as Emails, reports, telephonic conversations, media and many more do not have defined rules and are not predetermined in a structure. The analytical systems support only the structured data and here is where the complexity arises with Unstructured as they differ a lot Technologically, Organizationally, Structurally with structured data.

To include the unstructured data in to these analytical systems for analysis the data must be read and integrated with the structured data. Thus Reading the unstructured data is not an easy job too because of the following complications that are related to it-

Text on paper to be converted to electronic format through OCR (Optical Character Recognition), but many times text can be damaged, paper may be too old thus here the data should be read and then entered.
Voice reading to be converted using VCR (Voice character recognition)- may affect quality
Reading text from the following formats- .pdf, .txt, .doc, .ppt, .xls, .txt compatible, Lotus, Outlook may require the need of third party vendors as they are more efficient and reliable.
Taped telephonic conversations through VCR

Once all the data is read from different sources it must be Integrated to prepare it for analysis.

Why do we need Data Integration?

It offers Simple Search
Alternate Names can be searched indirectly
Related Terms can be searched indirectly
Permutation of words can be searched indirectly

Wednesday, July 13, 2016

Unstructured Data Complexity

What is Unstructured Data?

Data that does not have a predetermined structure is categorized as unstructured such as- emails, telephone conversation transcripts, videos, pictures etc.

Complexity with this type of huge amounts of data arises because, the exiting analytical systems serves only the structured information. To bridge this gap between these two forms of data the following points must be given deep thought-

- How to read the unstructured files?

Unstructured data files can be in any format and ability to read data from such files must be present. Formats can be- pdf, txt, doc, ppt, xls, txt compatible, lotus, outlook. Data may be also be present in voice recordings.

Two major types of Unstructured data - BLOB and CLOB

Oracle BLOB: Binary Large Objects

BLOB accepts the objects that are large in size like- images, audio etc. These are the binary strings, unstructured large binary objects with a Length: 2,147,483,647 characters. By default they are 2 gigabytes long.

It is represented in bit streams.

In a database, tables can contain BLOB data in multiple columns but these cannot be the Primary key for the table. Cannot be used with Distinct, Group By, Order By or joins as well.

Oracle CLOB: Character large Objects

It is represented in character streams and accepts large text data elements can be up to 2,147,483,647 characters. By default its value- 2,147,483,647 characters.

When a file is read having the CLOB data it is read in Strings.

It is used to store large documents and systems store the CLOB data in different ways- Can be stored as a reference (this means that it will contain pointers to the original location), stored inside a table etc. Cannot be defined as Primary Keys, Foreign Keys, and Not NULL.

Monday, July 11, 2016

ETL Step by Step- Continuation

Data Transformation:

Involves transforming the data as per the standards as the data may come from many dissimilar source systems.

Basic tasks involved are-

Selection: Select either the whole records or parts of several records from source systems.
Splitting/Joining: Involves splitting the records/joining the records from various records.
Conversion: Converts the data into standardized forms and makes the fields usable and easy to understand to users.
Summarization: As lowest level of detail is not required.
Enrichment: Simplification of fields to make them more useful.

Most common Transformation types include:

· Format revisions
· Decoding of Fields
· Calculated and Derived Fields
· Splitting of Single Fields
· Merging of Information
· Character Set Conversion
· Conversion of Units of Measurement
· Date/Time Conversion
· Summarization
· Key Restructuring
· Deduplication

Data Loading:

Loading process involves creation of Load images and data can be loaded in the following modes- LOAD, APPEND, DESTRUCTIVE MERGE and CONSTRUCTIVE MERGE.

LOAD:

APPEND:

DESTRUCTIVE MERGE:

CONSTRUCTIVE MERGE:

Wednesday, July 6, 2016

ETL Step by Step

ETL- Extraction, Transformation and Loading

(Data Staging)

Data Staging stage also called as the ETL Stage reshapes the Source data into useful information that is stored in Data Warehouse. This stage extracts, cleans and integrates the data in proper format.

The technological architecture of a Data Warehouse involves-

· Data Acquisition
· Data Storage
· Information Delivery

ETL encompasses the areas of Data Acquisition and Storage. Thus, the basic steps involved in data staging includes-

Data Extraction:

Extracting the data from the source systems can be based on Immediate Data Extraction or Deferred Data Extraction. Immediate involves real time extraction of data i.e., when the transaction occurs in the source database. It involves capturing data through the transaction logs or Database triggers or capture in Source application. On the other hand, Deferred data extraction involves capturing data based on Date/Time stamps or by comparing files.

Note:

Extraction method that is Easy and Inexpensive to implement: Transaction Logs or Database Triggers
Most Expensive Extraction Method: Data Capture in Source systems
Extraction method with Least Impact on Operating system: Deferred Data system

**File Comparison method is the most time consuming for data extraction**

Data Transformation:

Involves transforming the data as per the standards as the data may come from many dissimilar source systems.

Basic tasks involved are-

Selection: Select either the whole records or parts of several records from source systems.
Splitting/Joining: Involves splitting the records/joining the records from various records.
Conversion: Converts the data into standardized forms and makes the fields usable and easy to understand to users.
Summarization: As lowest level of detail is not required.
Enrichment: Simplification of fields to make them more useful.

Most common Transformation types include:

· Format revisions
· Decoding of Fields
· Calculated and Derived Fields
· Splitting of Single Fields
· Merging of Information
· Character Set Conversion
· Conversion of Units of Measurement
· Date/Time Conversion
· Summarization
· Key Restructuring
· Deduplication

Data Loading:

Loading process involves creation of Load images and data can be loaded in the following modes- LOAD, APPEND, DESTRUCTIVE MERGE and CONSTRUCTIVE MERGE.

LOAD:

APPEND:

DESTRUCTIVE MERGE:

CONSTRUCTIVE MERGE:

Wednesday, June 29, 2016

METADATA in simple terms

Metadata can be defined as the data about data present in the Data Warehouse. It can be directly linked as the Data Dictionary of the Warehouse.

The following are the different types:

1) Operational Metadata

It consists of all the information about the operational data sources.

2) Extraction & Transformation Metadata

It consists of the information about the extraction of data from its sources, such as- extraction frequencies, methods or business rules and along with extraction it also contains details about the transformation.

3) End-User Metadata

It is the navigational map for the data warehouse as it enables the end users to search the warehouse with their own business terms.

Thus, Metadata connects the parts of the warehouse together and makes it easy for end users to search.