What is the problem with storing sparse two-dimensional training data (feature_vector x n_sample)? What is a space optimal way to store such a matrix?

In the context of Machine Learning & Data Science we use the metric ‘Sparsity’ to explain the degree of ‘emptiness’ of the data within a data structure (an array with two dimensions). That is not to say that the term is a reflection of how much ‘missing data’ is within our data structure, rather it applies to legitimate observations, of a real response, that happen to be ‘zero’.

‘Sparsity’ exists at two levels: –

– ‘Dense’ significantly more (or all) elements at a value that is non-zero.

– ‘Sparse’ significant number of  zero elements, usually more than non-zero.

It is usual to express ‘Sparsity’ as a percentage, or decimal proportion, of the ratio of zero / non-zero elements.

Real ‘Training Data’ often has nominal ‘Categorical Features’, with high cardinality. This requires you to perform ‘On-Hot-Encoding’ before it can be used in modelling activities. By doing this your dataset can easily change from ‘Dense’ to ‘Sparse’ in one operation. If this data is stored in a conventional array it will just contain a lot of zeros and thus would take more space. 

For a given conventional array ‘A’, as the size of A increases, so does the memory & storage sizes required to accommodate it. If we wish to iterate through the array, or perform other operations, there is also an increase in processing time. As data structure sizes increase, especially at enterprise or research levels, they can become too large for manipulation with a normal PC. 

If only the parts of the dataset that actually contain non-zero values are taken into consideration, i.e. a ‘Sparse Matrix’, then the memory and storage footprints are both reduced significantly. The basic  concept is to remove what amount to ‘placeholder’ zero-values and take advantage of the reduction in size of the data structure, there are however many different connotations: –

For example, the option to create a ‘Sparse Matrix’ exists in Sci-Kit Learn when we use One-Hot-Encoding for Categorical Features. The method used in this instance is known as ‘Dictionary Of Keys’,(DOK), where the key values are the non-zero elements within the matrix – and the zeroes are omitted.

Also if you are using Pandas the option of a ‘Sparse Structure’ is available to you. This is not so much of a true sparse matrix but more like a compressed array. Within this structure the zeroes are omitted from the data (however, other values can also be used, as an alternative to zero) and the structure becomed ‘compressed’ as a result of this thrifting of values that add no useful iinformation to the data.

However, it is within SciPy that the most functionality exists, as this library allows you to create & manipulate 7 different classes of Sparse Matrix, the explanation of each of these is beyond the remit of this document. But the fundamental concept remains the same.