Search

Graph Self-supervised Learning with Accurate Discrepancy Learning

Type
Technical
Published year
2022
Journal / Conference
NeurIPS
Keyword
Molecular Property
Drug Repositioning
Molecular Representation
Status
Done
Language
πŸ‡ΊπŸ‡Έ
Blog upload
Yes
Date
2022/09/21
1 more property

Summary

β€’
Authors proposed a framework called D-SLA that aims to learn the exact discrepancy between the original and the perturbed graphs.
β€’
Three major components
1.
Learn to distinguish whether each graph is the original graph or the perturbed one.
2.
Capture the amount of discrepancy for each perturbed graph (using edit distance)
3.
Learn relative discrepancy with other graphs

Preliminaries

Graph Neural Networks (GNN)

1.
Aggregate the features from its neighbors
2.
Combining the aggregated message
Variants of Update & Aggregate functions
β€’
Graph Convolution Network (GCN)
General convolution operation + Mean aggregation
β€’
GraphSAGE
Concatenate representations of neighbors with its own representation when updating
β€’
Graph Attention Network (GAT)
Considers the relative importance among neighboring nodes when aggregation
β€’
Graph Isomorphism Network (GIN)
Sum aggregation

Self-supervised learning for graphs (GSL)

Aims to learn a good representation of the graphs in an unsupervised manner.
β†’ Transfer this knowledge to downstream tasks.
Most prevalent framework for GSL
β€’
Predictive learning (PL)
Aims to learn contextual relationships by predicting sub-graphical features (nodes, edges, subgraphs)
β—¦
predict the attributes of masked nodes
β—¦
predict the presence of an edge or a path
β—¦
predict the generative sequence, contextual property, and motifs
But predictive learning may not capture the global structures and/or semantics of graphs.
β€’
Contrastive learning (CL)
Aims to capture global level information.
β—¦
Early CL learn the similarity between the entire graph and its substructure.
β—¦
Others include attribute masking, edge perturbation, and subgraph sampling.
β—¦
Recent CL adversarial methods generate positive examples either by adaptively removing the edges or by adjusting the attributes.
But CL may not distinguish two topologically similar graphs yet having completely different properties.
Minimize LCL=βˆ’log⁑fsim(hGi,hGj)βˆ‘Gβ€²,Gβ€²β‰ G0fsim(hGi,hGβ€²)\mathcal{L}_{CL} = - \log \frac{f_{\text{sim}} (h_{\mathcal{G}_i}, h_{\mathcal{G}_j})}{\sum_{\mathcal{G}', \mathcal{G' \neq \mathcal{G}_0}}f_{\text{sim}}(h_{\mathcal{G}_i}, h_{\mathcal{G}'})}
β—¦
G0\mathcal{G}_0: original graph
β—¦
Gi,Gj\mathcal{G}_i, \mathcal{G}_j: perturbed graphs
β—¦
Gβ€²\mathcal{G}': other graph in the same batch with the G0\mathcal{G}_0, a.k.a. negative graph
β—¦
positive pair: (Gi,Gj)(\mathcal{G}_i, \mathcal{G}_j); negative pair: (Gi,Gβ€²)(\mathcal{G}_i, \mathcal{G}')
β—¦
fsimf_\text{sim}: similarity function between two graphs β†’ L2L_2 distance or cosine similarity
β†’ similarity of positive pair ↑\uparrow, similarity of negative pair ↓\downarrow

Discrepancy Learning

1.
Discriminate original vs perturbed
Perturbed graph could be semantically incorrect!
β†’ Embed perturbed graph apart from original.
LGD=βˆ’log⁑(eS0eS0+βˆ‘iβ‰₯1eSi)Β withΒ S=fS(hG)\mathcal{L}_{GD} = - \log \Big (\frac{e^{S_0}}{e^{S_0} + \sum_{i \geq 1}e^{S_i}} \Big ) \text{ with } S = f_S(h_\mathcal{G})
Intuitively,
β€’
large value of eS0e^{S_0} for the original graph
β€’
small value of eSie^{S_i} for the perturbed graphs
How to perturb?
Aim at perturbed graph to be semantically incorrect
1.
Remove or add a small number of edges
Manipulate the edge set by removing existing edges + adding new edges on XE\mathcal{X}_\mathcal{E}
2.
Mask node attributes
Randomly mask the node attributes on XV\mathcal{X}_\mathcal{V} for both original and perturbed graphs
(to make it more difficult to distinguish between them)
G0=(V,E,XV0~,XE),XV0~∼M(G)\mathcal{G}_0 = (\mathcal{V}, \mathcal{E}, \tilde{\mathcal{X}^0_{\mathcal{V}}}, \mathcal{X}_\mathcal{E}), \tilde{\mathcal{X}^0_{\mathcal{V}}} \sim \texttt{M}(\mathcal{G})
Gi=(V,Ei,XVi~,XEi),XVi~∼M(G),(Ei,XEi)∼P(G)\mathcal{G}_i = (\mathcal{V}, \mathcal{E}^i, \tilde{\mathcal{X}^i_{\mathcal{V}}}, \mathcal{X}^i_\mathcal{E}), \tilde{\mathcal{X}^i_{\mathcal{V}}} \sim \texttt{M}(\mathcal{G}), (\mathcal{E}^i, \mathcal{X}^i_\mathcal{E}) \sim \texttt{P}(\mathcal{G})
Personal opinion
β€’
The real usage of discriminator loss will be to push original & perturbed graph apart, while applying edit distance loss.
2.
Discrepancy with Edit distance
How dissimilar?
β€’
Usually, we need to measure the graph distance, such as edit distance.
Edit distance: number of insertion, deletion, and substitution operations for nodes & edges to transform one graph from another. β†’ NP hard!
β€’
But we know the exact number of perturbations for each graphs
β†’ use it as distance.
Ledit=βˆ‘i,j(dieiβˆ’djej)2Β withΒ di=fdiff(hG0,hGi)\mathcal{L}_{edit} = \sum_{i, j} \Big ( \frac{d_i}{e_i} - \frac{d_j}{e_j}\Big )^2 \text{ with } d_i = f_\text{diff}(h_{\mathcal{G}_0}, h_{\mathcal{G}_i})
fdifff_{\text{diff}} measures the embedding level differences between graphs with L2 norm.
eie_i: edit distance (number of perturbations)
The trivial solution for the edit distance loss is di=dj=0d_i = d_j = 0. But because of the discriminator loss, this is not possible.
3.
Relative discrepancy learning with other graphs
Assumption:
Distance between original and negative graphs in the same batch is larger than the distance between the original and perturbed graphs with some amount of margin.
Formally,
Lmargin=βˆ‘i,jmax⁑(0,Ξ±+diβˆ’djβ€²)\mathcal{L}_{margin} = \sum_{i, j} \max (0, \alpha + d_i - d'_j)
did_i: distance between original and its perturbed graphs
djβ€²d'_j: distance between original and negative graphs
Intuitively, Ξ±+di<djβ€²\alpha + d_i < d'_j !

Overall loss

L=LGD+Ξ»1Ledit+Ξ»2Lmargin\mathcal{L} = \mathcal{L}_{GD} + \lambda_1 \mathcal{L}_{edit} + \lambda_2 \mathcal{L}_{margin}

Results