Handling of Missing Data

Purpose

This document describes how missing data is handled in user queries, specifically after the requested data is found, but before it is sorted and before values requested in the query are extracted from the data.

Summary

Missing data is interpreted as (i) present and (ii) having unset values.

Specification

Below, we refer to object’s “properties”. A property can be viewed as a key:value pair. Keys are strings. Values have various types. Objects can have multiple properties. For the purpose of explaining the policy, it is instructive to imagine that object’s properties are stored in an associative container; property value and can be extract using a lookup by property key using a string.

Below, we use words ‘missing’ and ‘absent’ interchangeably.

The policy for handling missing data is as follows:

  • missing property values are interpreted as unset, which correspond to null in Json and None in Python;
  • missing property keys are interpreted as present, and the corresponding values as unset;
  • when comparing property values during sorting, missing values compare to be less than non-missing; and
  • property keys that are missing are never inserted, this avoiding path-dependency (or hysteresis) in query results.

As a technical remark, we note that no distinction is drawn between property keys that are (a) absent in some objects but present in other objects, and (b) absent from the database.

Example 1

Let us use an unrelated field for sorting the results. We look for images of apples, list some of their properties (as key:value pairs) and sort the images by intelligence. In case IQ field is a sort field and is displayed, we will see

Result_Long = [{"color": "green", "type": "granny_smith", "IQ": null},
               {"color": "red_yellow_stripe", "type": "gala", "IQ": null}]

which is correctly sorted by IQ, because

Result_Long[0]["IQ"] == Result_Long[1]["IQ"] == null

Whether using IQ as a sorting field is a mistake or not, is left for the users to decide. If IQ==null is not what they expect, then this is how the users know they made a mistake. In case the users do not output the IQ field, the result will be

Result_Short = [{"color": "green", "type": "granny_smith"},
                {"color": "red_yellow_stripe", "type": "gala"}]

which is still correctly sorted by IQ. We recommend including the sort field in the output during development phase to facilitate spot-checking for application-level errors, such as misspelling or mis-capitalizing property names.

Example 2

Let us misspell or mis-capitalize a filed name. We look for images of apples, list two properties, “type” and “COLOR”, and request sorting by “COLOR”. Note that “COLOR”, in uppercase, is not an existing property, while lowercase “color” is. Executing the query would result in the following:

Result_Misspell = [{"COLOR": null, "type": "granny_smith"},
                   {"COLOR": null, "type": "gala"}]

The result show that the value of “COLOR” attribute, in uppercase, is always null, which suggests a possible application-level mistake.

In case the user do not care about color attribute and do not ask to display it, the output will be:

Result_Misspell = [{"type": "granny_smith"},
                   {"type": "gala"}]

which is still correctly sorted by uppercase “COLOR” field.