Filling Missing Values

Real-world datasets have gaps: sensors fail, forms get half-filled, and numbers go missing. Many models (especially distance-based ones like KNN) need complete numeric features to compare examples fairly. Filling missing values (imputation) preserves data you’d otherwise throw away, keeps neighborhoods intact, and reduces the bias that comes from dropping rows wholesale. A strong baseline is to replace missing numeric entries with the mean value, which is simple, deterministic, and avoids information leakage from future data. It won’t capture complex structure and can shrink variance if data aren’t missing at random, but it’s often good enough to get a model moving.

Depending on your problem, you might instead drop rows or columns with too many gaps, or tame extreme values by dropping/clipping outliers; more advanced options include median or k-NN imputation, regression-based fills, or multiple imputation, which we will cover later in the course.

Greenhouse Gaps: Patch, Then Predict

Your plant log has some missing data points: sensors blinked, papers smudged - and KNN can’t measure distances without numbers. Dropping rows would shrink neighborhoods and bias results. A simple fix is to fill missing numeric features using the mean from the preexisting log only, then let nearby plants vote with their values.

You are asked to read a log of n plants with d numeric features and a numeric target. For each feature, compute the mean on the preexisting data only (ignoring NA). Then fill NA values in both the preexisting rows and the new rows with those mean values. After which, predict each new row’s target using k-NN regression with k=2 and Euclidean distance on the filled features. If a column has no observed numbers in the preexisting data, treat its mean as 0. If several candidates tie at the cutoff distance for the 2 nearest neighbors, choose the one that appears earlier in the input.

The first line of the input contains two integers n d representing how many preexisting rows there are and how many numeric features each row has. Each of the next n lines contains d tokens (each a floating-point number or the word NA) followed by a floating-point target.

The next line contains a single integer q for how many new rows to predict. Each of the next q lines contains d tokens (each a floating-point number or NA).

The program should print q lines; each line should contain a single floating-point prediction.

Input	Output
3 2 NA 2 5 3 NA 7 10 10 9 2 0 2 10 NA	6 7
5 3 1 NA 2 10 NA 4 2 8 1 5 NA 6 2 6 3 12 3 7 4 14 2 2 5 NA NA 6 3	9 9
4 2 5 0 10 7 0 8 9 0 6 NA 0 12 2 6 0 NA 0	9 10

Constraints

Time limit: 2 seconds

Memory limit: 512 MB

Output limit: 1 MB

Filling Missing Values

Greenhouse Gaps: Patch, Then Predict

Constraints

To check your solution you need to sign in

Sign in to continue