One-hot embeddings are the simplest way to turn categories or tokens into numbers a model can use: you build a fixed vocabulary, give each item an index, and represent any item as a vector of length |vocab| with a single 1 at its index and 0 everywhere else.
Vocabulary
One-Hot Embedding
cat
1 0 0 0
dog
0 1 0 0
fox
0 0 1 0
banana
0 0 0 1
unknown
0 0 0 0
Because every category is equally distant from every other, one-hot encodes no notion of similarity (e.g., “cat” isn’t closer to “dog” than to “banana”). It’s easy to create and perfectly preserves identity, which is why it’s common for small categorical features (colors, postal codes, etc.). The trade-offs are high dimensionality when vocabularies are large and sparsity (almost all zeros), which can waste memory and hinder generalization.
In nearest-neighbor setups with one-hot features, Euclidean and cosine distance reduce to a simple match vs. non-match test: great for crisp lookups, poor for capturing nuanced relationships. For richer meaning and compactness, models usually replace or augment one-hot with learned dense embeddings that place related items close together in a continuous space.
Martian Pet Parade
On Mars, the Pet Parade console decides whether a pet is calm or friendly. Judges look at two things - the pet’s age (years) and the species - then let the console’s 3-nearest neighbors cast the vote.
You are asked to read a fixed species vocabulary, then some preexisting entries, and some new pets. Build vectors by scaling the age with min–max using the preexisting entries only (range [0,1]; a flat age column becomes 0.0), after which, perform a one-hot encoding of the species in the order they appear in the vocabulary, and concatenate the scaled age with the one-hot species. Use Euclidean distance on the resulting concatenation to find the 3 nearest neighbors. Predict by majority vote among the 3 closest preexisting entries; if there’s a label tie, print the alphabetically smallest label. All species in the data are guaranteed to be in the provided vocabulary.
The first line of the input contains a single integer v - the number of species in the vocabulary. The second line contains v space-separated tokens - the species vocabulary. The next line contains two integers n d, where here d = 1 (there is only the age feature). Each of the next n lines contains an age (floating-point number), a species, and a label - age species label. The next line contains a single integer q - how many new pets to classify. Each of the next q lines contains age species for a new pet.
The program should print q lines. Each line should contain the predicted label - calm or friendly.
Input
Output
3 cat dog rabbit 6 1 1 cat calm 2 dog friendly 3 dog friendly 5 rabbit calm 4 cat calm 2 rabbit calm 2 3 dog 5 rabbit
friendly calm
2 cat dog 5 1 1 cat calm 4 cat calm 2 dog friendly 3 dog friendly 5 dog friendly 2 2 cat 5 dog
calm friendly
4 cat dog rabbit lizard 7 1 1 cat calm 2 cat calm 2 dog friendly 3 lizard friendly 4 rabbit calm 5 rabbit calm 4 dog friendly 2 3 dog 4 rabbit