Data Warehousing and the Unstructured Data
As we have discussed so far, it is clear that most enterprises build data warehouse using the data available within the internal source systems. Besides available internally in the organization, this data is structured and has been configured in a regular format.
However, we could sometimes encounter chunks of data that is useful for the organization but not available within. This data is termed as External data in the warehouse, which is found unstructured and in unpredictable format. It is always advised to store such external data in the data warehouse as it could be helpful for business analysis and operations. If it isn’t stored in DWH, several problems arise.
Data Warehouse Tutorial Video
External data in data warehouse
There are primarily two types of external data
- External data records gathered by some external source like supermarket, medicine store, clothing store, etc.
- Data from random articles and reports available in the internet.
Problems with external data
Several issues that arise due to the use and storage of external data are-
- Frequency of availability: There is no fixed pattern or appearance of the external data and thus, it must be constantly monitored to ensure capturing the appropriate data in the warehouse.
- Totally Undisciplined data: Since external data is unformatted and unstructured, certain structuring functionalities need to be implemented to make it meaningful and usable in DWH. The external data is passed through simple checks like domain check, and made compatible with the internally available data.
- Unpredictability of data: external data may come from any source at any time causing an irregularity and uncertainty of the available data.
Metadata and External Data
Metadata in a warehouse plays a major role when any external data is identified.
Metadata provides users to determine information about the external data. There is another type of data associated with metadata – notification data, which alerts and notifies users about the data they are interested in the form of a file. When the external data enters the data warehouse and in the metadata, a check is placed to see who is interested in that external one. The system notifies that persona and then the external data is captured in the warehouse.
Storing the External Data
Archiving External Data
Since every information after certain time becomes uninteresting and is not worth keeping. Similarly, external data must be archived by deciding the useful lifetime of the data. However, we still face the dilemma whether that external data needs to removed or archived. The rule says that the external data must be removed from the warehouse and stored in less-expensive storage files. Consequently, the metadata reference to external data is also updated to indicate the new storage area, and thus, it is left there in the metadata.
Comparison between Internal and External data
One of the most meaningful actions attempted on external data is comparing it with internal data over a period of time. This comparison will enable business people to gain insights into unique components, which could never be possible without otherwise. For instance, you can compare personal activities and trends with the global trends. Mostly, this comparison is done on a common key.
Storing external data in DWH provides some relevant information that is available outside the company and helps businesses to run and update their enterprise processes accordingly.