TextNet: A Unified Embedding for Context Detection and Clustering



In 2015, Google released a model for face detection and clustering called FaceNet. The FaceNet model utilizes a deep convolutional neural network that optimizes the creation of an embedding rather than optimizing the prediction of classes. In a similar fashion, this blog post tries to explore how we could apply the same principle to text – extracting embeddings for context detection and clustering of text data.


The appropriate passage data for the model was hard to acquire – optimally we would require passages with the same context paraphrased in multiple different ways classified into different contexts. But since we cannot find an appropriate dataset ( let me know if anybody knows any), we have used the UCI News Aggregator Dataset.


For the architecture, we have used a Bert for generating embeddings and augmented it with a 1d Convolutional Layer. This embedding is then optimized through a triplet loss. Here’s how we calculate the triplet loss:

  1. Select an Anchor Text Embedding, call it \(t_{a}\)
  2. Let Positive Text Embedding as \(t_{p}\) and Negative Text Embedding as \(t_{n}\)
  3. Select hard positives and hard negatives as follows
    1. \(\text{argmax}_{t_{p}^{i}} \vert t_{a}^{i} – t_{p}^{i}\vert\)
    2. \(\text{argmin}_{t_{n}^{i}}\vert t_{a}^{i} – t_{n}^{i}\vert\)
  4. Then the loss that is being minimized is \(L = \sum_{i}^{N} \vert t_{a}^{i} – t_{p}^{i} \vert + \vert t_{a}^{i} – t_{n}^{i} \vert + \alpha\)

Preliminary Results

The authors of the original FaceNet specifies a validation metric as follows

  1. \(TA(d) = \{(i,j) \in P_{same}, D(t_{i},t_{j}) \leq d\}\)
  2. \(VAL(d) = \frac{\vert TA(d) \vert}{\vert P_{same} \vert}\)
  3. where \(P_same\) are pairs of text belonging to same context and \(d\) is the threshold and \(D(x,y)\) is the distance function

We have \(VAL(d) = 0.803\) where \(d = 0.0670\)


I’ll write another blog post comparing TextNet architecture with other embedding frameworks. Any criticism is appreciated

Discover more from Niranjan Krishna

Subscribe to get the latest posts to your email.

Leave a Reply

Your email address will not be published. Required fields are marked *