TextNet: A Unified Embedding for Context Detection and Clustering

Feb 4

TextNet: A Unified Embedding for Context Detection and Clustering

Category:

Tags:

In 2015, Google released a model for face detection and clustering called FaceNet. The FaceNet model utilizes a deep convolutional neural network that optimizes the creation of an embedding rather than optimizing the prediction of classes. In a similar fashion, this blog post tries to explore how we could apply the same principle to text – extracting embeddings for context detection and clustering of text data.

Data

The appropriate passage data for the model was hard to acquire – optimally we would require passages with the same context paraphrased in multiple different ways classified into different contexts. But since we cannot find an appropriate dataset ( let me know if anybody knows any), we have used the UCI News Aggregator Dataset.

Architecture

For the architecture, we have used a Bert for generating embeddings and augmented it with a 1d Convolutional Layer. This embedding is then optimized through a triplet loss. Here’s how we calculate the triplet loss:

Select an Anchor Text Embedding, call it \(t_{a}\)
Let Positive Text Embedding as \(t_{p}\) and Negative Text Embedding as \(t_{n}\)
Select hard positives and hard negatives as follows
1. \(\text{argmax}_{t_{p}^{i}} \vert t_{a}^{i} – t_{p}^{i}\vert\)
2. \(\text{argmin}_{t_{n}^{i}}\vert t_{a}^{i} – t_{n}^{i}\vert\)
Then the loss that is being minimized is \(L = \sum_{i}^{N} \vert t_{a}^{i} – t_{p}^{i} \vert + \vert t_{a}^{i} – t_{n}^{i} \vert + \alpha\)

Preliminary Results

The authors of the original FaceNet specifies a validation metric as follows

\(TA(d) = \{(i,j) \in P_{same}, D(t_{i},t_{j}) \leq d\}\)
\(VAL(d) = \frac{\vert TA(d) \vert}{\vert P_{same} \vert}\)
where \(P_same\) are pairs of text belonging to same context and \(d\) is the threshold and \(D(x,y)\) is the distance function

We have \(VAL(d) = 0.803\) where \(d = 0.0670\)

Conclusion

I’ll write another blog post comparing TextNet architecture with other embedding frameworks. Any criticism is appreciated

Discover more from Niranjan Krishna

Subscribe to get the latest posts sent to your email.

Niranjan Krishna