Discovering information and knowledge from big volumes of data is a problem that confronts many Intelligence Agencies within Australia and possibly throughout the world. Specifically, the Australian Crime Commission (ACC) was faced with the problem of how to deal with the large amounts of intelligence data it was receiving and to collate and analyze this data. Moreover, the existing approach of ‘data cleansing’ was proving to be ineffectual, inefficient and could no longer cope with increasing amounts of data the agency was receiving. A different approach had to be found to replace the previous solution that involved transforming the data into a form that could loaded into a highly structured relational system. In addition to the transformation, much of the data was discarded simply because the data did not comply to the schema definition. The thesis will investigate existing methodologies or the ‘Schema-First’ Approach in relation to intelligence collation and analysis. The thesis will demonstrate that the ‘Schema- Last’ Approach in combination with the ‘Big Data’ platform could be applied when the data is retrieved and analyzed and not when the data is first ingested. It will be shown that the ‘schema-last’ allows for new and novel approach to entity resolution and data fusion. The approach of fusing data where the format, structure or quality cannot be guaranteed is now a real issue with many government and private organizations. If the structure and quality can be guaranteed then there is no issue. However, in the case of the ACC this is not the case and organizations are compelled by law to hand over data extracts. In many cases, data obtained this way is often not easily processable or able to be fused with the data sets. The ‘schema-last’ is a new approach to the way data is utilized within the Intelligence Life-cycle. This approach applies a schema to the data only when the data is required not when the data is first acquired. In addition, it will be shown how this approach provides the foundation to explore, exploit, analyze and fuse data without losing any data integrity or provenance, and the approach is an improvement on existing similar approaches. It will also be shown how this approach can be extended to an ontological representation of the data and, like a schema, ontological structures could be applied when the data is analyzed not on data ingestion. A new approach requires a new platform to store, retrieve the data to replace the relational normalized representation currently in use. It will be shown that the Big Data platform is not only used to store the large data volumes but provide the mechanisms to support the ‘schema-last’ approach for data analytic. Other potential solutions will be discussed and how these approaches would prove to be deficient compared to ‘schema-last’. Finally, the ‘Minerva’ application utilizing the ‘schema-last’ approach proved to be a successful implementation. This resolved many of the data quality issues and enabled analytics to be performed against data that could not be processed by the previous relational based system. The proposed novel ‘Schema-Last’ model is validated against several possible implementations, however further work is required resulting from questions encountered in this research.
|Date of Award||2014|
|Supervisor||Sharma D (Supervisor) & Xu Huang (Supervisor)|