Detecting filtered data

It’s not always easy to know whether your data has been manipulated before you received it.

Some control systems depend on feedback from measurement data. It’s not unusual for the data to be collected by a device that uses unknown filters. These filters may introduce changes in the data that can affect the design of feedback control system. This post is about a simple way of detecting whether a data stream has been altered in some way after the raw data was collected.

Almost all measurement systems use some kind of analog to digital (A/D) converter at some point during the measurement process. These converters have a specific sampling resolution that limits the number of bits they can generate to represent values. For example, an 8-bit A/D converter can only generate values between 0 and 255, representing each of the possible arrangements of 8 binary values. In theory, the maximum number of values present in a data set is $2^N$ where $N$ is the number of bits used to represent values in the data set. So for any given data set, we expect to observe at most $2^N$ distinct or unique values. If we observe any more than that, then we might conclude that the data set has somehow been manipulated after the raw data was collected from the A/D converter.

So how does one evaluate a data set to determine whether it has an unexpected number of values?

We use a concept called entropy to calculate the expected and actual entropy of a data set. This entropy has its origin in information theory, and is evaluated as follows:

  • Actual entropy: This is the entropy we actually observe in a data set. The entropy $\sigma = \log(len(unique(data)) / \log(2)$, when $len(\cdot)$ is the count of values, and $unique(\cdot)$ is the set of distinct values.

  • Expected entropy: This is the entropy we expect to observe given measurement system employed. This expected entropy is simply $\bar\sigma=N$, where $N$ is the number of bits of the A/D converter. If the measurement system employs the full dynamic range of the A/D, then resulting data sets should have actual entropies that approach but do not exceed the expected entropy.

If a low resolution data set has been altered in some way, then it may have a actual entropy that is higher than the expected entropy based on the measurement system. Similarly, if a high resolution data has been missampled or subjected to lossy compression, then it may have a actual entropy that is significantly lower than expected. This can be used to detect whether a data set has been subjected to filtering, added noise, lossy compression, or is simply sampled using a dynamic range that is too low.

Here’s an example of how this work using a randomly generated data that simulates a measurement with a hypothetical 8-bit A/D that outputs only the lowest 200 values in the full 256-value dynamic range of the converter:

>>> import numpy as np
>>> from math import log
>>> x = np.random.randint(0,200,1000000)
>>> log(len(set(x)))/log(2)
7.643856189774724

Notice that the entropy of the data is less than 8, as expected.

Now we apply a simply smoothing filter and evaluate the entropy of the resulting data set

>>> y = (x[:-1]+x[1:])/2.0
>>> log(len(set(y)))/log(2)
8.640244936222347

Notice that the entropy is now greater than the expected value of 8. We conclude that the data has been filtered after it was collected.

Written on July 26, 2020