The Inventors present a data compression algorithm that employs a small coreset of a large sensor data set to yield a highly efficient compression system. This technology can be used in any applications where signal compression and simplification is desired, including services that handle video streams from wearable cameras, mobile sensors, GPS-based location and mapping, financial data and biological signals.
Field-deployed sensor networks are collecting massive amounts of data in real-time across a wide range of applications, resulting in a great demand for systems that can support streaming and performing computation on this high frequency data. Methods that first store all the data from a given fielded sensor network struggle to analyze data in real time. Additionally, most analysis tools today are based on data mining algorithms that can only handle blocks of static data on the order of a few gigabytes, that fit in the internal memory. The Inventors’ approach employs coresets, which are small sets that approximately represents the original data, to achieve an efficient method of data compression. Running queries or fitting models on the coreset will yield similar results when applied to the original data set.
The Inventors use techniques from computational geometry to solve problems in data and signal compression using coresets. The input to the coreset algorithm is a constant e > 0, and a set P of n points in representing n signals from d sensors) that can be approximated by k segments. The algorithm returns a coreset of O(k) points, whose size is independent of the original input n, such that the Hausdorff Euclidean distance from P to any given set of points is preserved up to an additive error of e. This compression guarantees an approximation error for any query.
These coresets can be constructed in parallel, as well as in a streaming setting where data points arrive one by one, in which it would be otherwise impossible to remember the entire data set due to memory constraints. Additionally, they can be computed at one pass over streaming data and when combined with map-reduce techniques can yield a system capable of compressing a stream of O(n) points using only O(log n) space and update time.
Insertion time per new observation in coreset and
required memory is only linear in both the dimension of the data, and the
number k of segments.
Approach achieves dimensionality reduction of
very large scale sparse data sets by solving time and space limitations