Data Warehousing: July 2016

Monday, July 18, 2016

Why is Unstructured Data Complicated!

Why is Unstructured Data Complicated!

Structured systems like ATM Transactions, Airline Reservations, Manufacturing Inventory Control Systems, Point of sale systems and many more are a part of the daily operational activities and grow quickly. These are defined in rules and are supported by analytical systems.

However, Unstructured data such as Emails, reports, telephonic conversations, media and many more do not have defined rules and are not predetermined in a structure. The analytical systems support only the structured data and here is where the complexity arises with Unstructured as they differ a lot Technologically, Organizationally, Structurally with structured data.

To include the unstructured data in to these analytical systems for analysis the data must be read and integrated with the structured data. Thus Reading the unstructured data is not an easy job too because of the following complications that are related to it-

Text on paper to be converted to electronic format through OCR (Optical Character Recognition), but many times text can be damaged, paper may be too old thus here the data should be read and then entered.
Voice reading to be converted using VCR (Voice character recognition)- may affect quality
Reading text from the following formats- .pdf, .txt, .doc, .ppt, .xls, .txt compatible, Lotus, Outlook may require the need of third party vendors as they are more efficient and reliable.
Taped telephonic conversations through VCR

Once all the data is read from different sources it must be Integrated to prepare it for analysis.

Why do we need Data Integration?

It offers Simple Search
Alternate Names can be searched indirectly
Related Terms can be searched indirectly
Permutation of words can be searched indirectly

Wednesday, July 13, 2016

Unstructured Data Complexity

What is Unstructured Data?

Data that does not have a predetermined structure is categorized as unstructured such as- emails, telephone conversation transcripts, videos, pictures etc.

Complexity with this type of huge amounts of data arises because, the exiting analytical systems serves only the structured information. To bridge this gap between these two forms of data the following points must be given deep thought-

- How to read the unstructured files?

Unstructured data files can be in any format and ability to read data from such files must be present. Formats can be- pdf, txt, doc, ppt, xls, txt compatible, lotus, outlook. Data may be also be present in voice recordings.

Two major types of Unstructured data - BLOB and CLOB

Oracle BLOB: Binary Large Objects

BLOB accepts the objects that are large in size like- images, audio etc. These are the binary strings, unstructured large binary objects with a Length: 2,147,483,647 characters. By default they are 2 gigabytes long.

It is represented in bit streams.

In a database, tables can contain BLOB data in multiple columns but these cannot be the Primary key for the table. Cannot be used with Distinct, Group By, Order By or joins as well.

Oracle CLOB: Character large Objects

It is represented in character streams and accepts large text data elements can be up to 2,147,483,647 characters. By default its value- 2,147,483,647 characters.

When a file is read having the CLOB data it is read in Strings.

It is used to store large documents and systems store the CLOB data in different ways- Can be stored as a reference (this means that it will contain pointers to the original location), stored inside a table etc. Cannot be defined as Primary Keys, Foreign Keys, and Not NULL.

Monday, July 11, 2016

ETL Step by Step- Continuation

Data Transformation:

Involves transforming the data as per the standards as the data may come from many dissimilar source systems.

Basic tasks involved are-

Selection: Select either the whole records or parts of several records from source systems.
Splitting/Joining: Involves splitting the records/joining the records from various records.
Conversion: Converts the data into standardized forms and makes the fields usable and easy to understand to users.
Summarization: As lowest level of detail is not required.
Enrichment: Simplification of fields to make them more useful.

Most common Transformation types include:

· Format revisions
· Decoding of Fields
· Calculated and Derived Fields
· Splitting of Single Fields
· Merging of Information
· Character Set Conversion
· Conversion of Units of Measurement
· Date/Time Conversion
· Summarization
· Key Restructuring
· Deduplication

Data Loading:

Loading process involves creation of Load images and data can be loaded in the following modes- LOAD, APPEND, DESTRUCTIVE MERGE and CONSTRUCTIVE MERGE.

LOAD:

APPEND:

DESTRUCTIVE MERGE:

CONSTRUCTIVE MERGE:

Wednesday, July 6, 2016

ETL Step by Step

ETL- Extraction, Transformation and Loading

(Data Staging)

Data Staging stage also called as the ETL Stage reshapes the Source data into useful information that is stored in Data Warehouse. This stage extracts, cleans and integrates the data in proper format.

The technological architecture of a Data Warehouse involves-

· Data Acquisition
· Data Storage
· Information Delivery

ETL encompasses the areas of Data Acquisition and Storage. Thus, the basic steps involved in data staging includes-

Data Extraction:

Extracting the data from the source systems can be based on Immediate Data Extraction or Deferred Data Extraction. Immediate involves real time extraction of data i.e., when the transaction occurs in the source database. It involves capturing data through the transaction logs or Database triggers or capture in Source application. On the other hand, Deferred data extraction involves capturing data based on Date/Time stamps or by comparing files.

Note:

Extraction method that is Easy and Inexpensive to implement: Transaction Logs or Database Triggers
Most Expensive Extraction Method: Data Capture in Source systems
Extraction method with Least Impact on Operating system: Deferred Data system

**File Comparison method is the most time consuming for data extraction**