Algorithms are the molecules of all forms of Artificial Intelligence (A.I). An algorithm can be described as a formula, a finite series that converts input data (for example, generated by commands in search engines, mouse clicks and visiting web pages) into "output", a certain result. To enable profiling of certain categories of people or phenomena, algorithms must be trained with data sets. The training of algorithms, Supervised Machine Learning, is still a human affair. The value assigned to the dataset by the researcher or client influences the outcome of the data process.
Like the input of the data set, the outcome depends on prejudices. If the usual method of profiling is continued, a bias in the data sets and therefore in the profile or risk assessment is inevitable. Bias-driven profiling increases the risk of false positives, which is further enhanced by training with outdated (personal) data. Moreover, a weakness is attached to the application of algorithms in the data analysis process: algorithms are unable to describe causality, the relationship between cause and effect. Algorithms are only used to expose correlations between phenomena. That makes predictive algorithms unsuitable for testing expectations.
I will now elaborate on the weaknesses of the Data Analysis process and Supervised and Unsupervised Machine Learning. I will explain why their weaknesses make these instruments unsuitable as methods for profiling and risk assessment (for example in the context of social security/social insurance fraud). Moreover, I will conclude as to why the outcome of the automated decision making process will be biased since algorithms are inherently biased.
(Big) Data analysis: 'garbage out' is necessary to prevent algorithms from being trained incorrectly
An analytical instrument must be used to separate random or unstructured data from relevant data. The amount of data must not only be reduced, but also refined to arrive at a more specific result. For this purpose use can be made of a so-called "Warehouse", a digital collection of data from various sources. To prevent predictive algorithms from being trained with outdated data and to reduce the risk of false positives, the data must be refreshed and limited in size, ie: "garbage out". Using this reference point, correlations between data can be discovered. Nothing is known about the duration of storing personal data in a Warehouse; on the basis of the aforementioned, it will probably be for an indefinite period of time, except for a regular refreshment. Data controllers should be cautious during this early phase of the data analysis process: if inaccurate data is stored and applied, the person under investigation is incorrectly designated as a 'suspect' or 'potential fraudster'.
Data mining, Supervised Machine Learning and Artificial Intelligence
An important step in discovering correlations between datasets is "Knowledge Discovery of Databases", or "data mining." Statistical Analysis System (SAS) defines data mining as "the process of searching for anomalies, patterns and correlations, in order to predict a certain outcome." The predecessor of data mining is "machine learning", a technique that involves training algorithms based on statistical data. Formulas are entered to develop algorithms, training sets of data are given as "input" and the result is provided as "output". Algorithms are instructed to make the connection between input and output and to evaluate themselves. The result of this feedback is used to identify patterns. This form of "supervised machine learning" is ideally suited to classify data: algorithms categorize data on the basis of pre-provided, labeled data sets and learn to "label" data, or assign a certain characteristic. If a picture of a blackjack is given as input and a similar picture with the title "blackjack" is provided as the output, the algorithms learn to classify pictures of blackjacks.
Unsupervised Machine Learning, unsuitable for profiling
Another form of Machine learning, "Unsupervised Machine Learning", lacks the example of labeled data sets. Algorithms generate patterns between unstructured data. Unsupervised Machine Learning is used to cluster unstructured data, in order to categorize these data without using certain labels. The algorithms, for example, assign all kinds of pictures of blackjacks to one location, but do not know what these weapons are called. This makes Unsupervised Machine Learning unsuitable for the process of profiling, in which not only relationships have to be generated, but also names and classifications (for example "fraud!") will have to be linked to a certain result, the outcome of the profiling process.
A subform of Machine Learning is Deep Learning, the discovery of complex patterns in large amounts of data through a layered neural structure. The distinguishing feature of deep learning is the need for substantial "computational power" to perform a complex task; one neural layer can consist of up to four hundred processors. Machine Learning and Deep Learning Fall can be regarded subsets of Artificial Intelligence (A.I), but note that A.I. is not synonymous with either Machine Learning or Deep Learning. Artificial Intelligence studies the ability of computers to perform complex tasks autonomously and to solve problems.
Conclusion: algorithms are trained by human prejudice and therefore judgmental by design
Profiling in the current form is characterized by 'supervised machine learning', the training of algorithms with pre-entered data sets. These data sets, therefore algorithms, express human judgments by design. Algorithms used for profiling are not intelligent in the sense that they can discover patterns themselves. A weakness in profiling is that (predictive) algorithms are not used to test hypotheses, but only to present correlations between data and phenomena. Investigating the causality of an event remains a human matter. 'Mining' large amounts of data is also not a suitable method for testing hypotheses.
Profiling in the current form is characterized by 'supervised machine learning', the training of algorithms with pre-entered data sets. These data sets, therefore algorithms, express human judgments by design. Algorithms used for profiling are not intelligent in the sense that they can discover patterns themselves. A weakness in profiling is that (predictive) algorithms are not used to test hypotheses, but only to present correlations between data and phenomena. Investigating the causality of an event remains a human matter. 'Mining' large amounts of data is also not a suitable method for testing hypotheses.
No significance may be attached to the results of profiling and data mining until further human intervention or intervention has been carried out by an appropriate method. "False positives" should be trained out of programs that use predictive algorithms. Since the data sets that train predictive algorithms are biased, the automated decision making process will be biased. Inherently, the outcome of the automated decision making process or risk assessment will be biased.
Mercedes Bouter LL.M.