Monday, July 18, 2016

Why is Unstructured Data Complicated!


Why is Unstructured Data Complicated!

Structured systems like ATM Transactions, Airline Reservations, Manufacturing Inventory Control Systems, Point of sale systems and many more are a part of the daily operational activities and grow quickly. These are defined in rules and are supported by analytical systems.

However, Unstructured data such as Emails, reports, telephonic conversations, media and many more do not have defined rules and are not predetermined in a structure.  The analytical systems support only the structured data and here is where the complexity arises with Unstructured as they differ a lot Technologically, Organizationally, Structurally with structured data.


To include the unstructured data in to these analytical systems for analysis the data must be read and integrated with the structured data. Thus Reading the unstructured data is not an easy job too because of the following complications that are related to it-

  • Text on paper to be converted to electronic format through OCR (Optical Character Recognition), but many times text can be damaged, paper may be too old thus here the data should be read and then entered.
  • Voice reading to be converted using VCR (Voice character recognition)- may affect quality
  • Reading text from the following formats- .pdf, .txt, .doc, .ppt, .xls, .txt compatible, Lotus, Outlook may require the need of third party vendors as they are more efficient and reliable.
  • Taped telephonic conversations through VCR

Once all the data is read from different sources it must be Integrated to prepare it for analysis.

Why do we need Data Integration?
  • It offers Simple Search
  • Alternate Names can be searched indirectly
  • Related Terms can be searched indirectly
  • Permutation of words can be searched indirectly

Wednesday, July 13, 2016

Unstructured Data Complexity


What is Unstructured Data?


Data that does not have a predetermined structure is categorized as unstructured such as- emails, telephone conversation transcripts, videos, pictures etc.


Complexity with this type of huge amounts of data arises because, the exiting analytical systems serves only the structured information. To bridge this gap between these two forms of data the following points must be given deep thought-

-      How to read the unstructured files?


Unstructured data files can be in any format and ability to read data from such files must be present. Formats can be- pdf, txt, doc, ppt, xls, txt compatible, lotus, outlook. Data may be also be present in voice recordings.


Two major types of Unstructured data - BLOB and CLOB


Oracle BLOB: Binary Large Objects

BLOB accepts the objects that are large in size like- images, audio etc. These are the binary strings, unstructured large binary objects with a Length: 2,147,483,647 characters. By default they are 2 gigabytes long. 
It is represented in bit streams.

In a database, tables can contain BLOB data in multiple columns but these cannot be the Primary key for the table. Cannot be used with Distinct, Group By, Order By or joins as well.


Oracle CLOB: Character large Objects

It is represented in character streams and accepts large text data elements can be up to 2,147,483,647 characters. By default its value- 2,147,483,647 characters.

When a file is read having the CLOB data it is read in Strings.
It is used to store large documents and systems store the CLOB data in different ways- Can be stored as a reference (this means that it will contain pointers to the original location), stored inside a table etc. Cannot be defined as Primary Keys, Foreign Keys, and Not NULL.

Monday, July 11, 2016

ETL Step by Step- Continuation


Data Transformation:

Involves transforming the data as per the standards as the data may come from many dissimilar source systems.


Basic tasks involved are-

  1. Selection: Select either the whole records or parts of several records from source systems.
  2. Splitting/Joining: Involves splitting the records/joining the records from various records.
  3. Conversion: Converts the data into standardized forms and makes the fields usable and easy to understand to users.
  4. Summarization: As lowest level of detail is not required.
  5. Enrichment: Simplification of fields to make them more useful.

Most common Transformation types include:
  • ·       Format revisions
  • ·       Decoding of Fields
  • ·       Calculated and Derived Fields
  • ·       Splitting of Single Fields
  • ·       Merging of Information
  • ·       Character Set Conversion
  • ·       Conversion of Units of Measurement
  • ·       Date/Time Conversion
  • ·       Summarization
  • ·       Key Restructuring
  • ·       Deduplication
Data Loading:


Loading process involves creation of Load images and data can be loaded in the following modes- LOAD, APPEND, DESTRUCTIVE MERGE and CONSTRUCTIVE MERGE.

LOAD:




APPEND:





DESTRUCTIVE MERGE:






CONSTRUCTIVE MERGE:



Wednesday, July 6, 2016

ETL Step by Step

ETL- Extraction, Transformation and Loading

(Data Staging)

Data Staging stage also called as the ETL Stage reshapes the Source data into useful information that is stored in Data Warehouse. This stage extracts, cleans and integrates the data in proper format.

The technological architecture of a Data Warehouse involves-

  • ·       Data Acquisition
  • ·       Data Storage
  • ·       Information Delivery

ETL encompasses the areas of Data Acquisition and Storage. Thus, the basic steps involved in data staging includes-



Data Extraction:

Extracting the data from the source systems can be based on Immediate Data Extraction or Deferred Data Extraction. Immediate involves real time extraction of data i.e., when the transaction occurs in the source database. It involves capturing data through the transaction logs or Database triggers or capture in Source application. On the other hand, Deferred data extraction involves capturing data based on Date/Time stamps or by comparing files.

Note:
  • Extraction method that is Easy and Inexpensive to implement: Transaction Logs or Database Triggers
  • Most Expensive Extraction Method: Data Capture in Source systems
  • Extraction method with Least Impact on Operating system: Deferred Data system

**File Comparison method is the most time consuming for data extraction**


Data Transformation:

Involves transforming the data as per the standards as the data may come from many dissimilar source systems.


Basic tasks involved are-

  1. Selection: Select either the whole records or parts of several records from source systems.
  2. Splitting/Joining: Involves splitting the records/joining the records from various records.
  3. Conversion: Converts the data into standardized forms and makes the fields usable and easy to understand to users.
  4. Summarization: As lowest level of detail is not required.
  5. Enrichment: Simplification of fields to make them more useful.

Most common Transformation types include:
  • ·       Format revisions
  • ·       Decoding of Fields
  • ·       Calculated and Derived Fields
  • ·       Splitting of Single Fields
  • ·       Merging of Information
  • ·       Character Set Conversion
  • ·       Conversion of Units of Measurement
  • ·       Date/Time Conversion
  • ·       Summarization
  • ·       Key Restructuring
  • ·       Deduplication
Data Loading:


Loading process involves creation of Load images and data can be loaded in the following modes- LOAD, APPEND, DESTRUCTIVE MERGE and CONSTRUCTIVE MERGE.

LOAD:




APPEND:





DESTRUCTIVE MERGE:






CONSTRUCTIVE MERGE:




Wednesday, June 29, 2016

METADATA in simple terms


Metadata can be defined as the data about data present in the Data Warehouse. It can be directly linked as the Data Dictionary of the Warehouse. 

The following are the different types:

1) Operational Metadata

It consists of all the information about the operational data sources.

2) Extraction & Transformation Metadata

It consists of the information about the extraction of data from its sources, such as- extraction frequencies, methods or business rules and along with extraction it also contains details about the transformation.

3) End-User Metadata

It is the navigational map for the data warehouse as it enables the end users to search the warehouse with their own business terms.

Thus, Metadata connects the parts of the warehouse together and makes it easy for end users to search.

Monday, June 27, 2016

Inmon and Kimball Approach for Data Models




Bill Inmon supports the Top-Down approach for data warehouse design whereas, Ralph Kimball supports Bottom-Up design. Now let us discuss what each are in detail, but before that it is important to answer the following questions before starting the design of a warehouse-

  • Do we need a Top-Down/Bottom-Up approach for our warehouse?
  • Should the warehouse be enterprise wide or be departmental?
  • What do we create first- Data Mart or the Data Warehouse?
  • Should we capture dependent data marts/independent marts?

Whenever a large data warehouse is channeled to small data marts then it’s a Top-Down approach as here the data repository feeds into the local, departmental marts. However, when we have data marts created first and then consolidating them into a repository then it’s called the Bottom-Up approach.

Few Characteristics of both-

Bill Inmon’s Top-Down Approach
  1. Has an enterprise wide view of data.
  2. Is inherently architected.
  3. Has a single, central data storage
  4. Has centralized rules and control
  5. Gives quick results if implemented with iterations
  6. Takes quite a long time to build
  7. Has a high risk of failure

Ralph Kimball’s Bottom-Up

  1. Has a fast and easy implementation of manageable pieces
  2. Has a very less failure risk
  3. Id inherently incremented
  4. Each data mart has its own narrow view of data
  5. Allows redundant data
  6. Has inconsistent data
  7. Has unmanageable interface
  8. But, most severe drawback is Data Fragmentation


Wednesday, June 22, 2016

Building Blocks of a Warehouse


Building Blocks of a Warehouse

A Data Warehouse has the following major components-

Source Data: Sources from where the data actually enters a warehouse and this can be of the following categories- Production data, Internal Data, Archived data and External data.

Data Staging: Once we have acquired all the data it goes through the Stating area where the data is prepared to enter the warehouse. It has 3 major functions involved- Extraction of data, Transformation of Data and Loading of data.

Data Storage: Storage for a warehouse is kept separate from the Operational Systems storage because of the following reasons-
  • -        Warehouse data contains volumes of historical data for analysis.
  • -        Data is present in structures that are suitable for analysis.

Information delivery: As the user’s requirements are different thus the data should be delivered in a way that must support all the user’s requirements. Thus, there are multiple types if information delivery available such as-

  • -        Ad-hoc reports
  • -        Complex queries
  • -        MD analysis
  • -        Statistical Analysis
  • -        EIS Feed
  • -        Data mining

All the above are are effectively depicted in the Diagram below-


Thus these ares can be categorized into the following-
Data Acquisition , Data Storage and Information delivery also represented in the above Diagram -

While defining the architecture of the warehouse, the following factors must be considered-
  • Consider the number and extent of the data sources.
  • How many legacy systems are you going to extract the data from?
  • What are the external sources?
  • Are you planning to include departmental files, spreadsheets, and private databases?
  • What about including the archived data?






Monday, June 20, 2016

All about STAR and SNOWFLAKE Schema

STAR and SNOWFLAKE Schema

STAR Schema

Consists of one fact table that is surrounded by dimension tables. It is an easy to understand structure.

How it works to give results?
A query made on this structure can easily combine results from various dimensions linked to the fact table and result the required data required to answer any user question. Relationship of a record in a fact table is with the rows present in the dimension tables.

Example:



SNOWFLAKE Schema

 This is the Normalized version of STAR Schema. All the tables present as dimensions here are normalized. The Fact table remains the same as STAR Schema.

It holds an advantage of less storage space and being a normalized structure it is easy to update and maintain. Thus, for a database that changes frequently this structure is preferred.

Example:





Wednesday, June 15, 2016

Initial Steps-Data Warehouse Development

Dimensional Modeling

Now that we are aware of the basic concepts of Data Warehouse, it is important to understand that one of the key steps involved in the warehouse development involves understanding the requirements and presenting them in a view/model that is easy to understand for – users, developer and all the involved parties.

To provide this view the warehouses are modeled multidimensional with the help of Use Cases. Data models should always be easy to understand, read and develop.

Data Warehouse requirements have different abstraction levels for detailed, testable and complete specifications for the entire team. These levels are-

  • -        Business requirements
  • -        User requirements
  • -        Detailed System Requirements
  • -        Requirements Attributes (Functional, Information and Other)

These requirements are then captured in the Use cases that follows an iterative approach.

BUS Matrix

All the operational activities defined for an organization are captured in a BUS Matrix where each row defines a business process.



For instance, in the BUS Matrix defined above, each of the business process are captured as a separate row in the Matrix. Such as-

  • ·     Retail Sales
  • ·     Retail Inventory
  • ·     Retail Deliveries
  • ·     Warehouse Inventory
  • ·     Warehouse Deliveries
  • ·     Purchase Orders 

Each process captured above corresponds to one row each in the matrix. 

Monday, June 13, 2016

OLAP in detail



OLAP is a graphical tool that provides multidimensional views of data. Raw data when analyzed through OLAP can lead to a very easy way in understanding the data. It provides the users with all the information related to the data. Users can identify patters or trends from in the data easily through this tool and it makes the work easy as user no more has to search through piles of raw data.

Its biggest advantage is that it is beneficial for future predictions of an organization’s sales/costs/profit with respect to the current trends and pattern. Also, technical knowledge is not required to work on OLAP tool, only the knowledge of Relational Database is sufficient.

All the OLAP tools basically present the data to the users in the form of Multidimensional Cubes. Any relational tables can be represented in the form of a cube.

For Example, the below tables shows the records of watch sales in different locations- 




This table can be represented in the form of cube where-


Here it’s a 2 Dimensional layout.Each axis of cube is the dimension. Horizontal Dimension holds the product and vertical dimension holds the sales of the product. The above data is arranged as Products by Store but it can also be arranged as Store by products thus-

 Different business questions can be answered by arranging the data differently. 

Wednesday, June 8, 2016

Basics of Data Warehousing

Data Warehousing

What is a Data Warehouse?

In simple terms a repository for any organization that contains the data pertaining to all the areas defines a warehouse. Basic use is for analysis rather than for transaction processing. Thus, it separates the analysis workload from all the transactions workload. All the information gathered in a warehouse can be used for the following analysis using different technologies that can help analyze the data effectively.



Be it any king of business, a data warehouse can help in analysis of customers and their buying behavior or patterns. Predictive analysis on the data captured in the warehouse can lead to increased profit, perfect demand and supply ratio and many more benefits.


Overview-
On Line Analytical Processing helps in viewing and analyzing the historical data captured in the data warehouse.
x





Why do we need a Data Warehouse?

Before preparing the data model for a Warehouse, answer the following questions-

1.  Why do we need the warehouse, its business objective?
2.  What functionality should the warehouse cover?
3.  Types of data needed to support the functionality determined for the warehouse in the above step.
4.   How enormous the warehouse must be?

Once all the information is gathered the next step becomes clearer and easy but before that it is important to understand the term OLAP.

OLAP?



OLAP is a graphical tool that provides multidimensional views of data. Raw data when analyzed through OLAP can lead to a very easy way in understanding the data. It provides the users with all the information related to the data. Users can identify patters or trends from in the data easily through this tool and it makes the work easy as user no more has to search through piles of raw data.