Compressing token-embedding matrices for language models

January 23, 2024

[ad_1]

Pretrained language models (PLMs) like BERT, RoBERTa, and DeBERTa, when fine-tuned on task-specific data, have demonstrated exceptional performance across a diverse array of natural-language tasks, including natural-language inference, sentiment classification, and question answering.

PLMs typically comprise a matrix for token embeddings, a deep neural network featuring an attention mechanism, and an output layer. The token-embedding matrix often constitutes a substantial portion of the model due to its extensive vocabulary table: for instance, it accounts for more than 21% of BERT’s model size and 31.2% of RoBERTa’s. Moreover, due to variances in token frequencies, the token-embedding matrix contains numerous redundancies. Thus, any technique capable of compressing the token embedding matrix has the potential to complement other methods for model compression, resulting in a heightened compression ratio.

Desiderata

The ideal approach to compressing token-embedding matrices for PLMs should have the following characteristics: (1) task agnosticity, to ensure effective application across diverse downstream tasks; (2) model agnosticity, allowing seamless integration as a modular component for various backbone models; (3) synergistic compatibility with other model compression techniques; and (4) a substantial compression ratio with little diminishment of model performance.
LightToken meets these criteria, generating compressed token embeddings in a manner that is independent of both specific tasks and specific models.

Rank-k SVD approximation

Singular values for three transformer-based language models. Particularly for BERT, the first few singular values dominate the rest.

Numerous prior studies have highlighted the potency of singular-value decomposition (SVD) in effectively compressing model weight matrices. SVD decomposes a matrix into three matrices, one of which is a diagonal matrix. The entries in the diagonal matrix — the singular values — indicate how much variance in the data each variable explains. By keeping only the high singular values, it’s possible to project high-dimensional data down to a lower-dimensional subspace.

Token embedding matrices typically have a relatively small number of singular values. Consequently, the first step in our approach involves employing SVD to achieve a rank-𝑘 approximation for the token embedding matrix, for a small 𝑘.

Residual hashing

In experiments, we found that relying solely on the rank-k compression matrix, while it provided substantial compression, compromised performance on downstream tasks too severely. So LightToken also uses a residual binary autoencoder to encode the differences between the full token-embedding matrix and the matrix reconstituted from the rank-k compression matrix.

The architecture of the model that learns to produce hash codes for residual matrix values.

Autoencoders are trained to output the same values they take as inputs, but in between, they produce compressed vector representations of the inputs. We constrain those representations to be binary: they are the hash codes.

Binary codes are non-differentiable, however, so during model training, in order to use the standard gradient descent learning algorithm, we have to approximate the binary values with tempered sigmoid activation functions, which have a steep slope between low and high values.

Related content

Determining the optimal architectural parameters reduces network size by 84% while improving performance on natural-language-understanding tasks.

Over multiple training epochs, however, tempered sigmoids have a tendency to saturate: every node in the network fires with every input, preventing the network from learning nonlinear functions. So we created a novel training strategy, in which the autoencoder is trained to a point short of saturation, and then it’s trained on the residual between the old residual and the new hash code. The final hash code is in fact a concatenation of hash codes, each of which encodes the residual of the previous code.

Loss function

Typically, a model like ours would be trained to minimize the Euclidean distance between the reconstructed token-embedding matrix and the uncompressed matrix. But we find that Euclidean distance yields poor performance on some natural-language-processing (NLP) tasks and on tasks with small training sets. We hypothesize that this is because Euclidean distance pays inadequate attention to the angle between vectors in the embedding space, which on NLP tasks can carry semantic information.

So we propose a fresh reconstruction loss, which serves as an upper limit for Euclidean distance. This loss encourages the model to prioritize alignment between the original and compressed embeddings by recalibrating cosine similarity.

The full, four-stage LightToken framework.

We carried out comprehensive experiments on two benchmark datasets: GLUE and SQuAD 1.1. The outcomes clearly demonstrate the remarkable superiority of LightToken over the established baselines. Specifically, LightToken achieves an impressive 25-fold compression ratio while maintaining accuracy levels. Moreover, as the compression ratio escalates to 103, the incurred accuracy loss remains within a modest 6% deviation from the original benchmark.

[ad_2]

Source link

Compressing token-embedding matrices for language models

Amazon intern Qing Guo explores the interface between statistics and machine learning

Recent honors and awards for Amazon scientists

Evaluating the helpfulness of AI-enhanced catalogue data

May Prime games + what you missed

AWS scientists coauthor 13 QIP 2021 quantum computing papers

The quantum gambit

Leave a reply Cancel reply