Most organizations classify all data as either structured or unstructured. Just as the name implies, structured data benefits from being organized and primed for rapid queries via relatively simple search methods. Unstructured data has no inherent structure (although it may be “loosely structured”) and often defies attempts to yield easy search results.
Structured data lends itself to easy analysis by virtue of its organization and homogeneous information. Examples include many spreadsheets and all relational databases, as both are searchable by type and can thus easily and quickly present information to the user. All data is directly related to each other and relational database management systems (RDBMS) are optimized to answer user queries on the information.
Unstructured data contains little or no identifiable structure, usually due to the divergent nature of the data. The business world estimates that 80% of all useful business data rests in an unstructured state. An email provides one example. While email messages are sometimes organized within a database, the actual content of the message is not. It is possible to organize a host of emails by sender, data, etc., but it is not possible to execute a query about their content.
All unstructured data can be classified as either bitmap objects or textual objects. Bitmap objects include all data not based in language such as video, audio, and photos, while textual objects are based on written language typically found in word processor files and emails, among others. To be fair, the term “unstructured data” may be something of a misnomer, as much of it may actually be akin to “semi-structured data” that nonetheless does not easily cooperate with a RDBMS.
The challenge of mining unstructured data lays both in its potential for size and its lack of identifiable structure. RDBMSs cannot present the data in any meaningful form, so the desire to make unstructured data usable led to platforms like Hadoop and Cloudera. “Big Data” and unstructured data are not synonymous terms, but Big Data is almost always unstructured. If a company such as Google or Facebook needs a way to analyze user browsing habits or advertising information, then they use a distributed database management system (DDBMS) to do so. These DDBMSs can spread the voluminous data across a network spanning thousands of computers; they can also distribute the workload deriving from a query about that information across those same machines. It is possible to use other methods to analyze unstructured data; some of these include Google Refine, Firefox Firebug (for Flash sites), and PDF-parsing in conjunction with Ruby scripting.
As the world delves deeper into Information Age, the amount of sought-after Big Data will most likely grow. As Big Data is unstructured or at best semi-structured, businesses will continue to seek efficient methods for collecting, storing, and presenting meaningful analysis of data too big and too unfocused for traditional database management systems.