Embeddings: Representing Data as Vectors

Embeddings Overview

Embeddings represent data as dense vectors. They capture semantic meaning. Similar items have similar vectors. They enable similarity search and arithmetic operations. Word embeddings represent words as vectors. Sentence embeddings represent sentences. Document embeddings represent documents.

Embeddings transform discrete tokens into continuous vectors. They preserve semantic relationships. Words with similar meanings have similar vectors. They enable mathematical operations on meaning.

Figure: Embeddings Concept

The diagram shows embedding space. Similar words cluster together. Relationships appear as vector differences. King - Man + Woman approximates Queen.

Word Embeddings

Word embeddings map words to vectors. Word2Vec learns from context. GloVe learns from co-occurrence statistics. Both capture semantic relationships. Pre-trained embeddings work well for many tasks.

Word2Vec has two architectures. Skip-gram predicts context from word. CBOW predicts word from context. Both learn useful representations. Training uses neural networks on large text corpora.

# Word Embeddings with Word2Vec
from gensim.models import Word2Vec

sentences = [
    ['king', 'queen', 'royal'],
    ['man', 'woman', 'person'],
    ['paris', 'france', 'city'],
    ['london', 'england', 'city']
]

model = Word2Vec(sentences, vector_size=100, window=5, min_count=1)
word_vectors = model.wv

# Find similar words
similar = word_vectors.most_similar('king', topn=3)
print("Similar to 'king': " + str(similar))

# Vector arithmetic
result = word_vectors['king'] - word_vectors['man'] + word_vectors['woman']
similar_words = word_vectors.similar_by_vector(result, topn=3)
print("King - Man + Woman: " + str(similar_words))

-- NeuronDB: Word Embeddings Storage
CREATE TABLE word_embeddings (
    word VARCHAR(100) PRIMARY KEY,
    embedding vector(300)
);

-- Insert pre-trained embeddings
INSERT INTO word_embeddings (word, embedding) VALUES
    ('king', ARRAY[0.1, 0.2, ...]::vector(300)),
    ('queen', ARRAY[0.15, 0.18, ...]::vector(300));

-- Find similar words using cosine similarity
SELECT word, 1 - (embedding <=> (SELECT embedding FROM word_embeddings WHERE word = 'king')) AS similarity
FROM word_embeddings
WHERE word != 'king'
ORDER BY similarity DESC
LIMIT 5;

Word embeddings capture semantic relationships. They enable similarity search. They support arithmetic operations. They are foundational for NLP.

Detailed Word Embedding Training Methods

Word2Vec uses two architectures. Skip-gram predicts context words from target word. Continuous Bag of Words predicts target word from context. Both learn embeddings by predicting word co-occurrences.

Skip-gram maximizes probability of context words given target. P(w_{i-k}, ..., w_{i+k} | w_i). It works well for rare words. It requires more training data. It captures multiple contexts per word.

CBOW averages context word embeddings. It predicts target word from context average. It trains faster than skip-gram. It works well for frequent words. It uses less memory.

Training uses negative sampling. Instead of computing all vocabulary probabilities, sample negative examples. Reduces computation from O(V) to O(k) where k is number of negatives. Typical k is 5-20. Speeds up training significantly.

# Detailed Word2Vec Training
import numpy as np
from collections import defaultdict
import random

class Word2VecDetailed:
    def __init__(self, vocab_size, embedding_dim=100, window_size=2, negative_samples=5):
        self.vocab_size = vocab_size
        self.embedding_dim = embedding_dim
        self.window_size = window_size
        self.negative_samples = negative_samples
        
        # Initialize embeddings
        self.target_embeddings = np.random.randn(vocab_size, embedding_dim) * 0.01
        self.context_embeddings = np.random.randn(vocab_size, embedding_dim) * 0.01
        
        # Word frequencies for negative sampling
        self.word_freq = defaultdict(int)
    
    def skip_gram_step(self, target_idx, context_idx, learning_rate=0.01):
        # Positive example
        target_emb = self.target_embeddings[target_idx]
        context_emb = self.context_embeddings[context_idx]
        
        # Compute positive score
        score = np.dot(target_emb, context_emb)
        sigmoid_score = 1 / (1 + np.exp(-score))
        
        # Positive gradient
        grad_target_pos = (1 - sigmoid_score) * context_emb
        grad_context_pos = (1 - sigmoid_score) * target_emb
        
        # Negative sampling
        grad_target_neg = np.zeros(self.embedding_dim)
        grad_context_neg = np.zeros(self.embedding_dim)
        
        for _ in range(self.negative_samples):
            neg_idx = self.sample_negative(context_idx)
            neg_emb = self.context_embeddings[neg_idx]
            
            neg_score = np.dot(target_emb, neg_emb)
            sigmoid_neg = 1 / (1 + np.exp(-neg_score))
            
            grad_target_neg += -sigmoid_neg * neg_emb
            grad_context_neg += -sigmoid_neg * target_emb
        
        # Update embeddings
        self.target_embeddings[target_idx] += learning_rate * (grad_target_pos + grad_target_neg)
        self.context_embeddings[context_idx] += learning_rate * grad_context_pos
    
    def sample_negative(self, positive_idx):
        # Sample based on word frequency (unigram distribution)
        while True:
            neg_idx = random.randint(0, self.vocab_size - 1)
            if neg_idx != positive_idx:
                return neg_idx
    
    def train(self, corpus, epochs=10, learning_rate=0.01):
        for epoch in range(epochs):
            total_loss = 0
            for sentence in corpus:
                for i, target_word in enumerate(sentence):
                    # Get context words
                    start = max(0, i - self.window_size)
                    end = min(len(sentence), i + self.window_size + 1)
                    context_words = sentence[start:i] + sentence[i+1:end]
                    
                    for context_word in context_words:
                        self.skip_gram_step(target_word, context_word, learning_rate)
                        total_loss += 1
            
            print(f"Epoch {epoch+1}, Loss: {total_loss}")

# Example usage
corpus = [[0, 1, 2, 3], [1, 2, 3, 4], [2, 3, 4, 5]]  # Word indices
model = Word2VecDetailed(vocab_size=10, embedding_dim=50)
model.train(corpus, epochs=5)

Embedding Quality Evaluation

Evaluate embeddings using intrinsic and extrinsic tasks. Intrinsic tasks test embedding properties directly. Extrinsic tasks test downstream performance.

Intrinsic tasks include word similarity and word analogy. Word similarity compares embedding similarity to human judgments. Word analogy tests relationships like king - man + woman ≈ queen. These tasks measure embedding quality directly.

Extrinsic tasks test embeddings in applications. Text classification uses embeddings as features. Named entity recognition uses embeddings for sequence labeling. Machine translation uses embeddings for alignment. Performance on these tasks measures practical value.

# Embedding Quality Evaluation
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

def evaluate_word_similarity(embeddings, word_pairs, human_scores):
    """Evaluate embedding similarity against human judgments"""
    embedding_scores = []
    
    for word1, word2 in word_pairs:
        if word1 in embeddings and word2 in embeddings:
            sim = cosine_similarity(
                embeddings[word1].reshape(1, -1),
                embeddings[word2].reshape(1, -1)
            )[0][0]
            embedding_scores.append(sim)
        else:
            embedding_scores.append(0)
    
    # Compute correlation
    correlation = np.corrcoef(human_scores, embedding_scores)[0, 1]
    return correlation

def evaluate_word_analogy(embeddings, analogy_tests):
    """Evaluate word analogy tasks"""
    correct = 0
    total = 0
    
    for a, b, c, expected_d in analogy_tests:
        if all(w in embeddings for w in [a, b, c, expected_d]):
            # Compute: a - b + c should be close to expected_d
            vec = embeddings[a] - embeddings[b] + embeddings[c]
            
            # Find closest word
            similarities = {}
            for word, emb in embeddings.items():
                if word not in [a, b, c]:
                    sim = cosine_similarity(vec.reshape(1, -1), emb.reshape(1, -1))[0][0]
                    similarities[word] = sim
            
            predicted_d = max(similarities, key=similarities.get)
            
            if predicted_d == expected_d:
                correct += 1
            total += 1
    
    accuracy = correct / total if total > 0 else 0
    return accuracy

# Example
embeddings = {
    'king': np.array([0.5, 0.3, 0.2]),
    'queen': np.array([0.4, 0.4, 0.2]),
    'man': np.array([0.6, 0.2, 0.2]),
    'woman': np.array([0.3, 0.5, 0.2])
}

word_pairs = [('king', 'queen'), ('man', 'woman')]
human_scores = [0.8, 0.7]

correlation = evaluate_word_similarity(embeddings, word_pairs, human_scores)
print("Similarity correlation: " + str(correlation))

analogy_tests = [('king', 'man', 'woman', 'queen')]
accuracy = evaluate_word_analogy(embeddings, analogy_tests)
print("Analogy accuracy: " + str(accuracy))

Figure: Word Embeddings

The diagram shows word embedding space. Related words cluster together. Vector differences encode relationships.

Sentence Embeddings

Sentence embeddings represent entire sentences. They capture sentence meaning. They enable sentence similarity search. They work well for semantic search and clustering.

Detailed Sentence Embedding Methods

Averaging word embeddings is simple but limited. It computes mean of word vectors. It loses word order information. It works for short sentences. It fails for complex semantics.

Sentence encoders use neural networks. They process entire sentences. They preserve word order. They capture sentence structure. They work better than averaging.

Transformer-based encoders use BERT or similar models. They process sentences through transformer layers. They use [CLS] token or mean pooling. They capture rich semantic information. They work well for many tasks.

# Detailed Sentence Embedding Methods
from sentence_transformers import SentenceTransformer
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

# Method 1: Averaging word embeddings
def average_word_embeddings(sentence, word_embeddings):
    words = sentence.lower().split()
    word_vecs = [word_embeddings.get(word, np.zeros(300)) for word in words]
    if len(word_vecs) == 0:
        return np.zeros(300)
    return np.mean(word_vecs, axis=0)

# Method 2: Sentence transformer
model = SentenceTransformer('all-MiniLM-L6-v2')

sentences = [
    "Machine learning is a subset of artificial intelligence",
    "Deep learning uses neural networks",
    "AI enables computers to learn from data"
]

# Generate embeddings
embeddings_avg = [average_word_embeddings(s, {}) for s in sentences]  # Placeholder
embeddings_transformer = model.encode(sentences)

# Compare similarity
similarity_matrix = cosine_similarity(embeddings_transformer)
print("Sentence similarity matrix:")
print(similarity_matrix)

# Method 3: BERT-based with mean pooling
from transformers import AutoTokenizer, AutoModel
import torch

tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
bert_model = AutoModel.from_pretrained('bert-base-uncased')

def get_bert_embeddings(sentences):
    embeddings = []
    for sentence in sentences:
        inputs = tokenizer(sentence, return_tensors='pt', padding=True, truncation=True, max_length=128)
        with torch.no_grad():
            outputs = bert_model(**inputs)
        # Mean pooling
        embedding = outputs.last_hidden_state.mean(dim=1).squeeze().numpy()
        embeddings.append(embedding)
    return np.array(embeddings)

bert_embeddings = get_bert_embeddings(sentences)
print("BERT embeddings shape: " + str(bert_embeddings.shape))

Embedding Quality Metrics and Evaluation

Evaluate embeddings using multiple metrics. Intrinsic metrics test embedding properties. Extrinsic metrics test application performance. Both are important for assessment.

Intrinsic metrics include similarity correlation and analogy accuracy. Similarity correlation compares embedding similarity to human judgments. Higher correlation indicates better embeddings. Analogy accuracy tests word relationships. Higher accuracy indicates better structure.

Extrinsic metrics test downstream tasks. Classification accuracy uses embeddings as features. Clustering quality measures grouping performance. Retrieval performance measures search quality. Better embeddings improve task performance.

# Comprehensive Embedding Evaluation
from sklearn.linear_model import LogisticRegression
from sklearn.cluster import KMeans
from sklearn.metrics import adjusted_rand_score, silhouette_score
from scipy.stats import spearmanr

def evaluate_embeddings_comprehensive(embeddings, labels, similarity_pairs=None, human_scores=None):
    results = {}
    
    # 1. Similarity correlation (if human scores available)
    if similarity_pairs and human_scores:
        embedding_scores = []
        for word1, word2 in similarity_pairs:
            if word1 in embeddings and word2 in embeddings:
                sim = cosine_similarity(
                    embeddings[word1].reshape(1, -1),
                    embeddings[word2].reshape(1, -1)
                )[0][0]
                embedding_scores.append(sim)
        
        if len(embedding_scores) == len(human_scores):
            correlation, p_value = spearmanr(embedding_scores, human_scores)
            results['similarity_correlation'] = correlation
            results['similarity_p_value'] = p_value
    
    # 2. Classification performance
    X = np.array([embeddings.get(word, np.zeros(300)) for word in labels.keys()])
    y = list(labels.values())
    
    from sklearn.model_selection import cross_val_score
    clf = LogisticRegression()
    cv_scores = cross_val_score(clf, X, y, cv=5)
    results['classification_accuracy'] = cv_scores.mean()
    results['classification_std'] = cv_scores.std()
    
    # 3. Clustering quality
    n_clusters = len(set(y))
    kmeans = KMeans(n_clusters=n_clusters, random_state=42)
    cluster_labels = kmeans.fit_predict(X)
    
    ari = adjusted_rand_score(y, cluster_labels)
    silhouette = silhouette_score(X, cluster_labels)
    
    results['adjusted_rand_index'] = ari
    results['silhouette_score'] = silhouette
    
    return results

# Example evaluation
embeddings_dict = {
    'cat': np.random.randn(300),
    'dog': np.random.randn(300),
    'car': np.random.randn(300),
    'vehicle': np.random.randn(300)
}
labels_dict = {'cat': 0, 'dog': 0, 'car': 1, 'vehicle': 1}

results = evaluate_embeddings_comprehensive(embeddings_dict, labels_dict)
print("Evaluation results:")
for metric, value in results.items():
    print(f"{metric}: {value:.4f}")

Methods include averaging word embeddings, training sentence encoders, and using transformer models. Averaging is simple but loses word order. Sentence encoders preserve structure. Transformers capture complex relationships.

# Sentence Embeddings
from sentence_transformers import SentenceTransformer
import numpy as np

model = SentenceTransformer('all-MiniLM-L6-v2')

sentences = [
    "The cat sits on the mat",
    "A feline is on the rug",
    "The weather is sunny today"
]

embeddings = model.encode(sentences)

# Compute similarity
similarity = np.dot(embeddings[0], embeddings[1])
print("Similarity between sentence 1 and 2: " + str(similarity))
# High similarity indicates similar meaning

Sentence embeddings enable semantic search. They find sentences with similar meaning. They work regardless of exact word matches.

Figure: Sentence Embeddings

The diagram shows sentence embedding space. Semantically similar sentences cluster together.

Document Embeddings

Document embeddings represent entire documents. They capture document topics and themes. They enable document similarity and clustering. They work well for information retrieval.

Methods include averaging sentence embeddings, training document encoders, and using transformer models with pooling. Document encoders preserve document structure. Transformers capture long-range dependencies.

# Document Embeddings
from sentence_transformers import SentenceTransformer

model = SentenceTransformer('all-MiniLM-L6-v2')

documents = [
    "Machine learning is a subset of artificial intelligence...",
    "Deep learning uses neural networks with multiple layers...",
    "The weather forecast predicts rain tomorrow..."
]

doc_embeddings = model.encode(documents)

# Find similar documents
from sklearn.metrics.pairwise import cosine_similarity
similarity_matrix = cosine_similarity(doc_embeddings)
print("Document similarity matrix:")
print(similarity_matrix)

Document embeddings enable semantic document search. They find documents with similar topics. They work for large document collections.

Embedding Similarity and Distance

Similarity measures compare embeddings. Cosine similarity measures angle between vectors. Euclidean distance measures straight-line distance. Dot product measures alignment. Each suits different use cases.

Cosine similarity is cos(θ) = (A·B) / (||A|| × ||B||). It ranges from -1 to 1. Higher values mean more similar. It ignores vector magnitudes. It works well for embeddings.

# Embedding Similarity
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity, euclidean_distances

embeddings = np.array([
    [1.0, 0.0, 0.0],
    [0.9, 0.1, 0.0],
    [0.0, 1.0, 0.0]
])

# Cosine similarity
cos_sim = cosine_similarity(embeddings)
print("Cosine similarity:")
print(cos_sim)

# Euclidean distance
euc_dist = euclidean_distances(embeddings)
print("Euclidean distance:")
print(euc_dist)

Choose similarity measures based on needs. Cosine similarity works well for embeddings. Euclidean distance works for spatial data.

Figure: Embedding Similarity

The diagram shows similarity computation. Vectors with small angles have high cosine similarity.

Embedding Arithmetic

Embedding arithmetic performs operations on meaning. King - Man + Woman approximates Queen. It demonstrates captured relationships. It enables analogy solving.

Arithmetic works because embeddings capture relationships. Vector differences encode relationships. Adding differences applies relationships. Results approximate semantic operations.

# Embedding Arithmetic
from gensim.models import KeyedVectors

# Load pre-trained embeddings
word_vectors = KeyedVectors.load_word2vec_format('word2vec.bin', binary=True)

# King - Man + Woman ≈ Queen
result = word_vectors['king'] - word_vectors['man'] + word_vectors['woman']
similar_words = word_vectors.similar_by_vector(result, topn=5)
print("King - Man + Woman: " + str(similar_words[0][0]))  # Should be 'queen'

# Paris - France + Italy ≈ Rome
result = word_vectors['paris'] - word_vectors['france'] + word_vectors['italy']
similar_words = word_vectors.similar_by_vector(result, topn=5)
print("Paris - France + Italy: " + str(similar_words[0][0]))  # Should be 'rome'

Embedding arithmetic demonstrates learned relationships. It shows embeddings capture semantic structure. It enables analogy solving.

Figure: Embedding Arithmetic

The diagram shows embedding arithmetic. Vector operations approximate semantic relationships.

Pre-trained Embeddings

Pre-trained embeddings are trained on large corpora. They capture general language patterns. They work well for many tasks. They save training time and data.

Common pre-trained embeddings include Word2Vec, GloVe, FastText, and transformer embeddings. Word2Vec and GloVe are word-level. FastText handles subwords. Transformers provide contextual embeddings.

# Using Pre-trained Embeddings
import gensim.downloader as api

# Load pre-trained Word2Vec
word_vectors = api.load("word2vec-google-news-300")

# Use embeddings
similar = word_vectors.most_similar('computer', topn=5)
print("Similar to 'computer': " + str(similar))

Pre-trained embeddings provide strong baselines. They work well without fine-tuning. They enable quick prototyping.

Fine-tuning Embeddings

Fine-tuning adapts pre-trained embeddings to specific tasks. It improves performance on domain data. It requires task-specific training data. It balances general and specific knowledge.

Fine-tuning updates embedding weights. It preserves general knowledge. It learns task-specific patterns. It improves performance on target tasks.

# Fine-tuning Embeddings
from sentence_transformers import SentenceTransformer, InputExample, losses
from torch.utils.data import DataLoader

model = SentenceTransformer('all-MiniLM-L6-v2')

# Task-specific examples
examples = [
    InputExample(texts=['query about machine learning', 'document about ML']),
    InputExample(texts=['query about weather', 'weather forecast document'])
]

dataloader = DataLoader(examples, shuffle=True, batch_size=16)
loss = losses.CosineSimilarityLoss(model)

# Fine-tune
model.fit(train_objectives=[(dataloader, loss)], epochs=1)

Fine-tuning improves task performance. It adapts general embeddings to specific needs. It requires labeled task data.

Summary

Embeddings represent data as dense vectors. Word embeddings capture word meaning. Sentence embeddings capture sentence meaning. Document embeddings capture document topics. Similarity measures compare embeddings. Cosine similarity works well for embeddings. Embedding arithmetic performs operations on meaning. Pre-trained embeddings provide strong baselines. Fine-tuning adapts embeddings to tasks. Embeddings enable semantic search and similarity operations.