Integrated Gradients and Attention Visualization for Cost-efficient adaptation method #

By Dmitrii Kuzmin and Yaroslava Bryukhanova

Repo link: [https://github.com/1kkiRen/xAI]

Getting Started #

Explainable AI methods help us understand what drives a model’s decisions. Here, we compare three approaches: Integrated Gradients (IG), attention visualization, and embedding influence analysis — using two transformer-based language models:

meta-llama/Llama-3.2-1B-Instruct
ikkiren/research_TS

Our aim is to see which input tokens influence the models most and how attention patterns reflect their focus.

Models at a Glance #

meta-llama/Llama-3.2-1B-Instruct: A 1 billion-parameter Llama model tuned for conversational tasks.
ikkiren/research_TS: A custom model fine-tuned for Russian text, with special handling for morphology and syntax.

Methods Overview #

1. Integrated Gradients #

Integrated Gradients calculates attribution scores by integrating gradients from a baseline (zero) embedding up to the actual input embedding. We used 50 steps and tracked the convergence delta for each model.

IG provides a principled way to assign credit to each input token for the model’s prediction. Specifically:

We approximate the integral over the embedding path by sampling 50 points between the baseline and the input.
For each step, we compute the gradient of the target logit with respect to the interpolated embedding.
Attributions are obtained by summing the gradients and scaling by the difference from baseline.
We sum attributions across embedding dimensions to get a single score per token.

2. Attention Visualization #

We pulled attention weights from layer 0, head 0 of each model and created heatmaps to show token-level focus during prediction.

Attention weights indicate how much each token attends to every other token in a given head and layer.

Extracted raw attention tensors directly from the model’s forward pass (no gradient tracking needed).
Selected layer 0, head 0 as a representative case to illustrate global versus local focus.
Visualized weights as a heatmap where rows correspond to queries and columns to keys.
Normalized attention scores to sum to one per query token for clearer interpretation.
Implementation: Used Matplotlib for heatmap plotting, with tokens labeled along the axes.

3. Embedding Influence Analysis #

Ablation: We zeroed each token’s embedding and noted changes in the output logit. A big drop means the token is important.
Sensitivity: We added small Gaussian noise to each embedding, repeated the test, and measured the average output change. Higher variance indicates greater sensitivity.

To quantify embedding importance beyond gradients:

For ablation, we replaced a single token’s embedding with a zero vector, ran a forward pass, and recorded the delta in the target logit.
For sensitivity, we sampled Gaussian noise with σ=0.1 added to one embedding at a time, repeated for 10 random draws, and computed the standard deviation of the resulting logits.
Ablation captures direct contribution, while sensitivity captures how robust the prediction is to small changes in each embedding.
Both measures complement IG by revealing inhibitory versus supportive roles and points of instability.

Experimental Setup #

Data & Preprocessing:
1. Trim whitespace and normalize Unicode.
2. Tokenize with each model’s tokenizer.
3. Pad or truncate to 128 tokens.
Implementation: Python notebooks using Hugging Face Transformers and custom XAI utilities.

Results #

Integrated Gradients #

IG — meta-llama
IG attributions for meta-llama/Llama-3.2-1B-Instruct

IG — research_TS
IG attributions for ikkiren/research_TS

Raw Scores & Deltas #

meta‑llama/Llama‑3.2‑1B-Instruct

{
  "<|begin_of_text|>": -468275.8125,
  "С": 110818.0,
  "к": -88699.765625,
  "олько": -219594.265625,
  " план": -232745.015625,
  "ет": -29436.662109375,
  " в": -34299.03125,
  " С": 20781.52734375,
  "олн": -84084.0390625,
  "еч": -124966.0234375,
  "ной": -48963.0,
  " систем": 40694.140625,
  "е": -44074.41015625,
  "?": 126299.6328125
}

Absolute convergence delta: 1 048 421.3125

ikkiren/research_TS

{
  "<|begin_of_text|>": -337049.46875,
  "С": 352206.6875,
  "к": 61753.21484375,
  "олько": 74791.2890625,
  " план": 125668.390625,
  "ет": -31054.927734375,
  " в": 24036.47265625,
  " С": 38717.6484375,
  "олн": 36866.60546875,
  "еч": 66281.859375,
  "ной": 6265.0625,
  " систем": 66483.28125,
  "е": -36235.12890625,
  "?": 32296.85546875
}

Absolute convergence delta: 620 490.0625

What the Numbers Tell Us #

Both models give a large negative IG to “<|begin_of_text|>”.
meta‑llama: “?” and the first “С” are top positive contributors, while most other subwords have strong negative scores.
research_TS: The first “С” and " план" score highest, and negative attributions are smaller in magnitude.
A high |Δ| suggests that the numerical IG estimate can vary widely with nonlinear models.

Attention Visualization #

Attention — meta-llama
Layer 0, head 0 for meta-llama

Attention — research_TS
Layer 0, head 0 for research_TS

Key Observations #

The BOS token (<|begin_of_text|>) strongly attends to itself.
Most other tokens send the bulk of their attention to the BOS marker instead of neighbors.
Diagonal (self-attention) values are low (0.05–0.15), suggesting this head acts as a global context gatherer.
This pattern hints that layer 0, head 0 focuses on overall context rather than local dependencies.

Embedding Influence #

meta‑llama/Llama‑3.2‑1B-Instruct #

Ablation Sensitivity

Ablation: Zeroing “?” boosts the logit (+408 385), so “?” actually inhibits. “олько” is vital (+114 497).
Sensitivity: The first “С” and “ет” show the highest variance under noise, indicating they’re key for output stability.

ikkiren/research_TS #

Ablation Sensitivity

Ablation: “олько” is crucial (+145 365), while “?” remains inhibitory (−395 674).
Sensitivity: The BOS token and “?” cause the biggest output swings.

Takeaways #

Both models: “?” inhibits, “олько” supports.
IG and ablation agree on which tokens matter; sensitivity highlights instability points.
Grouping subwords into full words (e.g., “Сколько”) may simplify interpretation.

Discussion #

IG gives us clear numbers for token attributions, while attention maps show where the model looks. meta-llama has sharper peaks in attribution and attention; research_TS spreads importance more evenly.

Limitations #

We only checked layer 0, head 0 for attention.
The IG baseline is zero embeddings; other baselines might change insights.
Analysis is limited to one example.

Next Steps #

Explore attention across all layers and heads.
Try other attribution methods (SHAP, LIME).
Test IG with different baselines and on diverse text samples.

Conclusion #

Combining IG and attention visualization gives a richer view of transformer behavior. This helps researchers and developers understand, debug, and refine their models.