Suffering from too

much data?

Last year over 9.2 trillion emails were sent. Are they still all sitting on

your PC? Llewellyn Thomas of BearingPoint reveals the nine things

you never knew about unstructured data storage

 

Fact One: Unstructured data is approximately 80% of your data, according

to research done by analysts such as Gartner and Forrester. Also called

content, it is all the information not contained within structured (normally

relational) databases. It is not only the most obvious examples, such as

email, word processed documents, PDFs, spreadsheets, and presentations,

 but also includes images, instant messaging conversations, web pages,

project files, electronic forms, audio files, video files, graphics, reports,

computer output, and so on. In fact, if it’s not stored in a relational database,

or even if it is in a relational database but stored as a blob, then it is

considered content.

Fact Two: The most important attribute of unstructured content is its context. Context is the surrounding information about content that makes it make sense. For instance, an email that reads “I agree” is completely useless after the fact without the context in which it was written. The context of content is captured by metadata, or “data about the data”. The types of metadata have been the subject of quite a few standards, with one of the most widely known being the Dublin Core. Metadata not only puts the information within the file within context and enables more intelligent search and retrieval, it is also vitally concerned with the lifecycle of content – its retention and disposal. It is context that enables people unfamiliar with the information to know how to manage it – and when to dispose of it.

Fact Three: The content in your organisation is growing at an astonishing pace. IDC estimated that there were over 9.2 trillion emails sent in 2005 – by far the most pervasive and obvious of content type. And this figure is growing year on year. If you also factor in word-processed documents, spreadsheets, images, web pages and all those other types of content, it is obvious that there is an astonishing volume of content being created every day – with unstructured data estimated to make up about 85% of all data growth in the future. Some reports predict that the amount of corporate information content available to knowledge workers will increase six-fold by 2010.

Fact Four: You don’t know where all your content is. Although most organisations have systems that manage some of their content, such as email management and archiving, web content management, document management, digital asset management, imaging and COLD (computer output to laser disk), this is not uniform - every organisation has a multitude of different applications which use different data management. It is most definitely the case that not all content is managed – every organisation still maintains a shared file server. Individual functions and departments are also likely to have their own silos of content that are not actively disclosed. There are also various legacy systems not fully integrated from acquisitions, partial system migrations, and also the local hard disks of every employee.

Fact Five: You don’t know what all your content is. Due to the incredible amount of content being generated, the lack of reliable (if any) metadata, haphazardly applied disposal and retention schedules, variety of archiving systems, the uneven enforcement of information policy (if there even is one), the multitude of storage locations, partial system migrations, and of course ever changing technology, it is no wonder that many organisations have no idea what content they have. In fact, this lack of awareness about what content there is and where it resides is driving the current desktop and enterprise search phenomenon. To make matters worse, this can be more complex than it initially seems: does your organisation know the content of every dynamically served customer-facing webpage? Would it be complex and time-consuming, or impossible, to recreate the page? Although this area is not yet heavily litigated, the continuing growth in e-commerce is going to make this area more important,

Fact Six: Unmanaged unstructured data increases risk in your organisation. All businesses are aware of the increasing regulatory burden, be it FSA regulations, Basel II, Sarbanes-Oxley, MiFID or SEC regulations. However there are not just financial markets regulations to consider, but also those of health and safety, tax and environmental regulations to name some others. When considering all of these, or even one of them, with an understanding of the content explosion, it is apparent that there is an ever increasing exposure to risk. Examples of content storage and mismanagement abound: there have been hundreds of millions of dollars in fines levied on global banks, as recently reported in the press. These fines are testament to the fact that damage can occur due to not knowing where or what information is held by the organisation. In fact, it was the inability to disclose emails - due to the difficulty in finding and searching them - that led to record fines in one instance. These types of penalties are only likely to become more common in the future as the quantity of content continues to grow.

Fact Seven: Storage and archiving vendors only address part of this need. The main storage vendors, such as EMC, HP, Sun, HDS, NetApps and so on, all have solutions to help organisations archive and dispose of content. Archiving vendors, such as EMC, Enterprise Vault and Zantaz, build capabilities out on top of the storage vendors. The majority of them not only support compression and single file instance, but also helpfully include search engines. They also have useful management tools, such as hierarchical storage and time based retention; however these tend to ignore one of contents’ most important attributes – its context. For example, if the email mentioned above is three years old, and has not been viewed once since its archival, it would seem superfluous and suitable for deletion – but with context it may be the record of a vital business decision.

Fact Eight: ECM vendors only address part of this need. The main Enterprise Content Management (ECM) vendors, such as EMC-Documentum, FileNet, Hummingbird, IBM, OpenText, Stellent and Vignette all offer solutions to help address the content explosion. In comparison to the storage vendors, the ECM vendors instead focus on the context of the content and process enablement. As a consequence, they all have comprehensive Records Management (RM) capabilities for the disposal and retention of content, ensuring that it is done so in context. With the exceptions of EMC, who have combined ECM with data storage and archiving and OpenText with email archiving, all of the ECM vendors have basic storage and archiving technology. They attempt to off-set this with alliances with the storage and archiving vendors.

Fact Nine: Fear not, all is not lost. It is comforting to know that the attention of the software heavy weights such as Microsoft and Oracle is now in this direction. And what is even more interesting is that they are not only looking at this issue from the content perspective, but also from the desktop and enterprise search perspective. On the technology side, metadata is going to become embedded as standard in most future content file formats; for example, Microsoft Office 2007 files will have embedded XML-based metadata, and PDF/A, the ISO standard for archival PDF files, also has embedded XML-based metadata. For existing content, Information Lifecycle Management (ILM) holds promise. Popularised by EMC, true enterprise ILM will mean full integration of storage and archival capabilities with all of the capabilities of the enterprise content management layer (with the important context-relevant retention and disposal). And finally, the intellectual, practical and technological expertise of the Enterprise Data Management discipline is being applied to the management of unstructured content. With such activity and focus in this area, it is without a doubt that more advanced content storage, archival and management technologies, practices and solutions will emerge to better address the risks.

www.bearingpoint.com

 

Llewellyn Thomas of BearingPoint