博客
关于我
强烈建议你试试无所不能的chatGPT,快点击我
【论文研读】Similarity of Neural Network Representations Revisited (ICML2019)
阅读量:2089 次
发布时间:2019-04-29

本文共 4892 字,大约阅读时间需要 16 分钟。

Title: Similarity of Neural Network Representations Revisited (ICML2019)

Author:Simon Kornblith ...(Long Beach, California)

目录


Aim:

  • one can first measure the similarity between every pair of examples in each representation separately, and then compare the similarity structures.

 

invariance properties of similarity indexes 分为三个方面

1. Invariance to Invertible Linear Transformation

定义:A similarity index is invariant to invertible linear transformation if s(X, Y ) = s(XA, Y B) for any full rank A and  B.

Key sentence:

  • We demonstrate that early layers, but not later layers, learn similar representations on different datasets.
  • Invariance to invertible linear transformation implies that the scale of directions in activation space is irrelevant.
  • Neural networks trained from different random initializations develop representations with similar large principal components 不同初始化得到的主要参数是相似的。因此基于主要参数得到的不同网络间的相似性度量(如欧氏距离)的相似的。A similarity index that is invariant to invertible linear transformation ignores this aspect of the representation, and assigns the same score to networks that match only in large principal components or networks that match only in small principal components.

 2. Invariance to Orthogonal Transformation

定义:s(X, Y ) = s(XU, Y V ) for full-rank orthonormal matrices U and V  such that UTU = I and V TV = I.

Key sentence:
  • orthogonal transformations preserve scalar products and Euclidean distances between examples.
  • Invariance to orthogonal transformation implies invariance to permutation, which is needed to accommodate symmetries of neural networks

3. Invariance to Isotropic Scaling

定义:s(X, Y ) = s(αX, βY ) for any α, β R+

Key sentence:
  • This follows from the existence of the singular value decomposition of the transformation matrix
  • we are interested in similarity indexes that are invariant to isotropic but not necessarily non-isotropic scaling

 

Comparing Similarity Structures

If we use an inner product to measure similarity, the similarity between representational similarity matrices reduces to another intuitive notion of pairwise feature similarity.

1. Dot Product-Based Similarity.

2. Hilbert-Schmidt Independence Criterion.

  • HSIC as a test statistic for determining whether two sets of variables are independent. but HSIC is not an estimator of mutual information.
  • HSIC is not invariant to isotropic scaling, but it can be made invariant through normalization.

3. Centered Kernel Alignment.

  • Kernel Selection. RBF kernel k(xi, xj ) = exp(−||xi ixj ||^2_2/(2σ2 ))
  • In practice, we fifind that RBF and linear kernels give similar results across most experiments,

 

Related Similarity Indexes

1. Linear Regression.

We are unaware of any application of linear regression to measuring similarity of neural network representations,

2. Canonical Correlation Analysis (CCA)

The mean CCA correlation ρ¯CCA was previously used to measure similarity between neural network representations

3. SVCCA.

it is invariant to invertible linear transformation only if the retained subspace does not change.
 

4. Projection-Weighted CCA. 

closely related to linear regression

5. Neuron Alignment Procedures.

They found that the maximum matching subsets are very small for intermediate layers.

Summary:

SVCCA and projection-weighted CCA were also motivated by the idea that eigenvectors that correspond to small eigenvalues are less important, but 
linear CKA incorporates this weighting symmetrically and can be computed without a matrix decomposition.

 

Results

1. A Sanity Check

AimGiven a pair of architecturally identical networks trained from different random initializations, for each layer in the first network, the most similar layer in the second network should be the architecturally corresponding layer.

  • Results on a simple VGG-like convolutional network based on All-CNN-COnly CKA passes
  • Results on Transformer networks (all layers are of equal width): All indexes pass.

2. Using CKA to Understand Network Architectures​​​​​​​

  • 左图:CKA随layer增加而不发生改变的时候,ACC也没有改变。1*depth代表基本网络。2x、4x代表重复多少次
  • 右图:Layers in the same block group (i.e. at the same feature map scale) are more similar than layers in different block groups.
  • 右图:残差后的激活层和残差内的有差异;之间的则没有
  • CKA is equally effective at revealing relationships between layers of different architectures. As networks are made deeper, the new layers are effectively inserted in between the old layers.
  • increasing layer width leads to more similar representations between networks.
3. Across Datasets
  • CIFAR-10 and CIFAR-100 develop similar representations in their early layers.

Conclusion and Future Work

  • CKA consistently identififies correspondences between layers, not only in the same network trained from different initializations, but across entirely different architectures, whereas other methods do not
  • CKA captures intuitive notions of similarity, i.e. that neural networks trained from different initializations should be similar to each other.

转载地址:http://bfeqf.baihongyu.com/

你可能感兴趣的文章
机器学习中常用评估指标汇总
查看>>
什么是 ROC AUC
查看>>
Bagging 简述
查看>>
详解 Stacking 的 python 实现
查看>>
简述极大似然估计
查看>>
用线性判别分析 LDA 降维
查看>>
用 Doc2Vec 得到文档/段落/句子的向量表达
查看>>
使聊天机器人具有个性
查看>>
使聊天机器人的对话更有营养
查看>>
一个 tflearn 情感分析小例子
查看>>
attention 机制入门
查看>>
手把手用 IntelliJ IDEA 和 SBT 创建 scala 项目
查看>>
GAN 的 keras 实现
查看>>
AI 在 marketing 上的应用
查看>>
Logistic regression 为什么用 sigmoid ?
查看>>
Logistic Regression 为什么用极大似然函数
查看>>
为什么在优化算法中使用指数加权平均
查看>>
Java集合详解1:一文读懂ArrayList,Vector与Stack使用方法和实现原理
查看>>
Java集合详解2:一文读懂Queue和LinkedList
查看>>
Java集合详解4:一文读懂HashMap和HashTable的区别以及常见面试题
查看>>