THIS POST IS ARCHIVED

Tuesday, July 19, 2022

Tic-Tac-Toe: Knowledge Extraction From Imperative Code

Applications implemented in imperative programming languages encode vast amounts of human knowledge at their core, but the essence of the application logic is often obscured by the imperative nature of the implementation language. Imperative programming languages require the programmer to make detailed decisions on how the program is executed, and it can be hard to distill the actual logic from these codebases.

We explore the possibility of analyzing imperative code and extracting some of the essential application logic, eliminating irrelevant implementation details. This process is useful to document and verify the application logic, but in addition, legacy projects (for example, mainframe applications) can be converted semi-automatically to higher-level languages in this way. Being able to have direct access to the application encoded knowledge opens up possibilities for representing the application essence in a knowledge graph. This in turn enables all the benefits of such representation, including making it broadly available to developers or being able to issue complex queries against the accumulated knowledge.

To illustrate our knowledge extraction scheme, we use the game Tic-Tac-Toe, implemented in Java. The Java program manages a game between two players. It creates a board representation as a 2D character array and then, in a loop, reads each player’s moves until a winner is found or a draw is reached. The application logic that we would like to identify and extract contains the following:

the representation of the board itself,
the rules that determine whether there is a winner vs a draw, and
the constraints imposed on each move: only “O” or “X” are allowed, and you can only play on a previously unoccupied cell of the board.

Relational Models for Programs

First we are going to import the whole Java program as a set of relations that represent each Java statement. For this process, we build upon the Doop framework which is a state-of-the art static program analysis framework. Our analysis applies some of the general (analysis independent) facilities offered by Doop, to eventually represent a Java program in a relational model.

For instance, when we have an assignment x = y, this can be represented by the relation def assign = ("x", "y"). Similarly, storing a value to an array cell like array[i] = x can be represented by the relation def array_store = ("array", "i", "x").

A more intricate example is storing values to a 2D array. In Java, the assignment array[i][j] = x compiles to a sequence of simpler expressions. All but the last dimension access (j) are reading from memory locations to find the right array position to change. The last access is the actual write operation.

The initial expression boils down to first reading index i and storing the result to some new variable temp1 = array[i] and then storing x at index j of that temp location, as in temp1[j] = x. This is relationally represented by def array_load = ("temp1", "array", "i") and def array_store = ("temp1", "j", "x"). Conversions from Java to a relational model are common in state-of-the-art program analysis tools.

Logic Modeling

Now we can write logic to analyze the program and extract useful information depending on the features we want. For this we need a few different sets of logic rules.

Data (or Value)-Flow

This is logic that tracks where each program variable flows into. For example, in y = x followed by z = y, the value of variable x flows indirectly into variable z. Subsequently, when the program reads from variable z it is important to be able to keep track of this value flow. The following is a relational model of this reasoning:

def flows(fromVar, toVar) =
    assign(toVar, fromVar) // fromVar flows into toVar

def flows(fromVar, toVar) =
    assign(toVar, middleVar) and
    flows(fromVar, middleVar) // fromVar flows into toVar

Constant Folding

Constant folding is the logic for propagating and combining constants found in the program text. A simple example is n = 10; size = n + 3; new int[size] where we need to infer that the allocated memory for the array is 13 integers. Usually, constant folding logic needs to work together with data-flow logic for better results. An example of this reasoning logic follows:

def var_value(toVar, const) =
    assign_constant(toVar, const)

def var_value(toVar, newConst) =
    add(toVar, fromVar, const1) and
    var_value(fromVar, const2) and
    combine(const1, const2, newConst)

Symbolic Analysis

We need logic for symbolic analysis alongside a basic solver in order to create symbolic expressions for various program statements and potentially simplify them. For instance, we might create a symbolic expression for a condition x > y && y == 10 and later simplify it to x > 10 when analyzing the body of that branch.

Handling Imprecision

We need logic to handle imprecision on code parts that we don’t need to or cannot analyze more. Program analysis has tackled this in the past by introducing various kinds of abstractions. In our setting, if for instance, in a loop we add to a sum variable the constant value 1 an unknown number of times, we use the abstraction sum +* 1 to represent the value of the result.

Loop Intent Reasoning

Loops in imperative programming languages have various details that obscure the logic intent of the source code. We aim to infer the original intent from each looping construct. For example, we have logic that:

Infers induction variables (the index being incremented/decremented in the loop).
Infers the condition variable of the loop. That is the variable that determines whether or not the loop will terminate.
Tracks the initial value for such condition variables, before the loop.
Handles complex loop conditions (for example, A && B || C).

We use the aforementioned logic (alongside the array handling logic mentioned below) to infer for instance that a variable is used to iterate a dimension of an array. This holds when:

(a) a variable starts from 0,

(b) it is incremented by 1 in each iteration,

(d) it is used to read/write to the array.

Similar reasoning can be applied to variables that iterate an array in the opposite direction (from the end towards the start).

Array Reasoning

When dealing with arrays, we need logic to handle a variety of features:

Infer the number of dimensions in an array and their sizes.
Record constant values stored in the array (for example, initialization values).
Record expressions that write to some unknown array index (for example, the index’s value depends on user input) – and handle that as potentially writing to any cell in the array.

Testing the Waters With Tic-Tac-Toe

We implement an analysis that incorporates all the aforementioned logic and apply it to the Tic-Tac-Toe example. The results of the analysis are then used to semi-automatically generate a relational distilment of the application logic.

Our approach is not fully automatic, and that is a conscious choice. By specifically marking a few key code parts, we help our analysis avoid modeling parts of the application that are “noise” to the core application logic. In more detail:

For every array of interest (which we mark by hand), our model includes two relations. One to model the array’s dimensions, and another to model the actual values stored in each array cell. This Java snippet from our Tic-Tac-Toe example allocates the 2D array that stores the game board.

char[][] board = new char[3][3]

Our analysis infers the following relational model that corresponds to the previous code.

def board:dims = (3, 3)
def board:values(dim1, dim2, val)

The next essential step in the original application is the board array initialization.

//Initialize board with dashes
for (int i = 0; i < 3; i++) {
	for (int j = 0; j < 3; j++)
		board[i][j] = '-';

Its respective inferred model capturing the essence of the nested loops is the following.

def board:init(i, j, "-") =
   board:dims(dim1, dim2) and
   range[0, dim1 - 1, 1](i) and
   range[0, dim2 - 1, 1](j)
   from dim1, dim2

Handling each player’s moves is a bit trickier. The original code is expressive enough to allow a turn-by-turn play, reading players’ moves from the program’s input.

while (!gameEnded) {
      row = in.nextInt();
      col = in.nextInt();
      ...
      board[row][col] = c;

Our analysis currently infers an imprecise simulation of the original intent, by accepting moves as data in an external CSV file.

def board:data(d1, d2, v) =
   board:csv(:row, _, d1) and 
   board:csv(:col, _, d2) and
   board:csv(:val, _, v)

Finally, we combine information regarding the board state in a core relation.

def board(d1, d2, v) = board:init(d1, d2, v)
def board(d1, d2, v) = board:data(d1, d2, v)

In addition to all the previous logic, we analyze specific methods of interest (also marked by hand) as methods that correspond to (potentially multiple definitions of) relations on a high level. Such methods might, for instance, participate in loop conditions. One such method is playerHasWon, which determines whether we have a winner in the current game. The original Java implementation is quite straightforward, checking for a winner on each row, column and the diagonals.

//Check each row
for (int i = 0; i < 3; i++)
   if (board[i][0] == board[i][1] &&
      board[i][1] == board[i][2] &&
      board[i][0] != '-')
      return board[i][0];
//Check each column
for (int i = 0; i < 3; i++)
   if (board[i][0] == board[i][1] &&
      board[i][1] == board[i][2] &&
      board[i][0] != '-')
      return board[i][0];
//Check the diagonals
if (board[0][0] == board[1][1] &&
   board[1][1] == board[2][2] &&
   board[0][0] != '-')
   return board[0][0];
if (board[2][0] == board[1][1] && 
   board[1][1] == board[0][2] && 
   board[2][0] != '-')
   return board[2][0];
//Otherwise nobody has won yet
return ' ';

Our relational model captures the essence of the imperative code.

def playerHasWon(v) =
   board(i, 0, v) and board(i, 1, v) and board(i, 2, v) and
   v != "-" from i

def playerHasWon(v) =
   board(0, i, v) and board(1, i, v) and board(2, i, v) and
   v != "-" from i

def playerHasWon(v) =
   board(0, 0, v) and board(1, 1, v) and board(2, 2, v) and
   v != "-"
def playerHasWon(v) =
   board(2, 0, v) and board(1, 1, v) and board(0, 2, v) and
   v != "-"

def playerHasWon:default(" ") =
   not playerHasWon(_)

Similar modeling is done for the method that determines whether the board is full or not.

To wrap up our experiment, we test our generated relational model of our Tic-Tac-Toe application by providing a CSV file with the moves for each player (in order). Feeding the following data into our logic program will correctly infer that the winner of the game is the player with the “O”s.

row,col,val
0,0,X
1,1,O
0,2,X
0,1,O
2,0,X
2,1,O

Moving Forward

For our experiment, we used a Java implementation of the classic game of Tic-Tac-Toe. The original application is in the order of a couple hundred lines of code. Our distilled relational representation of the essential knowledge encoded in the application is just a fraction of that. Our generated relational model is a positive first step into extracting the essential application logic from a given imperative program.

Next steps towards augmenting our capabilities to recover application logic from imperative source code is to improve handling of state and temporal features. In a real Tic-Tac-Toe game, players take turns and the board changes step by step. This temporal aspect is not yet included in the analysis and we infer only the win/lose status, alongside some integrity constraints from a given set of unordered steps.

We expect that adding temporal features is quite feasible, but it requires altering our static abstraction of the source program. We could simulate that in our modeling by introducing a temporal dimension in our relations. This will reflect that each move happens at a specific point in time and it will lead into an incremental program.

We continue to experiment with the usefulness of human guidance in semi-guided application logic inference. There is a clear trade-off in the level of detail for the guidance. Too much, and the person providing the guidance has to deal with the overwhelming details of the actual program implementation. Too little, and the analysis might treat irrelevant parts of the code as equally important to the essence of the application logic.

There are vast amounts of human knowledge encoded in procedural/imperative code. Our work is a step towards extracting that knowledge, allowing it to be represented in a knowledge graph enabling broader access and easier querying.