Posts Tagged conversion

Converting R code to Python: A Test

Background

I’ve been using R as my statistics programming environment for decades. In fact, I started out using its predecessor, an earlier (non-open source) program called S. In the last 10 or so years, the role of data scientist has appeared, and in the main, data scientists use the Python programming language, which comes with its own many statistical libraries. In terms of statistics, there’s a huge overlap in the capabilities of R and Python — pretty much anything you can do in R, you can do in Python, and vice-versa. Of course, Python is also used everywhere these days in places you wouldn’t consider using R, e.g. building apps or websites.

In 2020, I completed a few Python modules at DataCamp, but haven’t needed to use it much since. However, Python is pretty much the #1 requirement when it comes to applying for data science positions, so I need to get comfortable with it.

Where I have my own (fairly extensive) library of R functions for various modelling, credit risk and maths purposes, it’d be nice to port all that code to Python — and even nicer if it could be done with as little effort as possible on my part! 😀 So let’s see how we get on with a small-ish function.

The function in R

In the 90s, my MSc dissertation was concerned with exact tests in contingency tables: see the wiki entry Fisher’s exact test for more details. In the PhD thesis of Dr Danius Michaelides, Exact Tests via Complete Enumeration: A Distributed Computing Approach, he describes an algorithm that enumerates integer partitions, subject to constraints (see Section 2.3.2 and Figure 2.2). This algorithm is used in the enumeration of contingency tables with fixed margins.

I’ve recreated this algorithm from his description. The function partitions_with_constraints takes two parameters: v, an array of maximum values, and N, the integer to be partitioned; and returns a matrix of all the possible partitions.

partitions_with_constraints <- function(v, N)
{
  vl = length(v);
  stopifnot( vl >= 2 );
  stopifnot( all(v > 0) );
  stopifnot( N <= sum(v) );

  cells = rep(NA,vl);

  res = c();

  i = 1;

  while (i > 0)
  {
    allocated = if(i==1) 0 else sum(cells[1:(i-1)]);
    remaining = N - allocated;
    
    if (i == vl) {

      # We're at the last cell,
      # which can only be the amount remaining.
      # Add the cells array to our results.

      cells[i] = remaining;
      res = rbind(res, cells);
      i = i - 1

    } else {

      if (is.na(cells[i])) {
        # Set the cell to the minimum possible
        cells[i] = max(0, remaining - sum(v[(i+1):vl]));
      } else {
        # Otherwise, increment
        cells[i] = cells[i] + 1
      }
      cell.max = min(remaining, v[i]);

      if (cells[i] > cell.max) {
        # We've hit the maximum; blank the cell, and move back one
        cells[i] = NA;
        i = i - 1;
      } else {
        i = i + 1;
      }
    }
  }

  # Restore the indices on the rows:
  res = matrix(res, nrow=dim(res)[1], byrow=F);
  return(res);
}

If we run it using the example from the thesis, we get:

partitions_with_constraints(c(3,1,3),4);
     [,1] [,2] [,3]
[1,]    0    1    3
[2,]    1    0    3
[3,]    1    1    2
[4,]    2    0    2
[5,]    2    1    1
[6,]    3    0    1
[7,]    3    1    0

To explain the output in words: there are 7 ways in which we can partition the number 4 into sums of three numbers, where the first number is between 0 and 3, the second number is between 0 and 1, and the third number is between 0 and 3.

Converting the code with Copilot

For the conversion, I used Microsoft Copilot. (There is a ‘pro’ version, but I’m using the free website version.) If you google Copilot, it seems to be referred to mainly as a ‘chatbot’, but that term feels rather inadequate — so I asked Copilot how to describe itself! It said “Microsoft Copilot is an AI companion designed to assist users by providing information, answering questions, and engaging in conversation.” An ‘AI companion’ is a better description, I think.

Into the ‘Ask me anything’ box, I wrote “Please convert this R function into Python:” and pasted in the code from above. Within a few seconds, it returned the Python code you see below.

The function in Python

Here’s the code that Copilot returned:

import numpy as np

def partitions_with_constraints(v, N):
    vl = len(v)
    assert vl >= 2
    assert all(i > 0 for i in v)
    assert N <= sum(v)

    cells = [None]*vl
    res = []

    i = 1

    while i > 0:
        allocated = 0 if i == 1 else sum(cells[0:i-1])
        remaining = N - allocated

        if i == vl:
            cells[i-1] = remaining
            res.append(cells[:])
            i -= 1
        else:
            if cells[i-1] is None:
                cells[i-1] = max(0, remaining - sum(v[i:vl]))
            else:
                cells[i-1] += 1
            cell_max = min(remaining, v[i-1])

            if cells[i-1] > cell_max:
                cells[i-1] = None
                i -= 1
            else:
                i += 1

    return np.array(res)

It also output some additional text, describing how the primitive function calls are translated: This Python function should behave similarly to your R function. Please note that Python uses 0-based indexing, unlike R which uses 1-based indexing. This means that some adjustments were needed in the translation. Also, Python does not have a direct equivalent to R’s NA, so None is used instead. The rbind function in R is replaced with list append and conversion to a numpy array in Python. The stopifnot function in R is replaced with assert in Python. The all and sum functions in Python behave similarly to their counterparts in R. The length function in R is replaced with len in Python. The rep function in R is replaced with list multiplication in Python. The max and min functions in Python behave similarly to their counterparts in R.

For me, the most impressive thing is the conversion of arrays from 1-based (i.e. arrays are indexed starting from 1 in R) to 0-based (as in Python, like e.g. C/C++); that wasn’t specified anywhere in the instructions, it just ‘knew’ to do it.

I often use one of the following online sites to run Python (if the code isn’t too big):

I used the first one, pasting the code in, plus the call to the function with the same parameters as in the R version:

print(partitions_with_constraints([3,1,3],4))
[[0 1 3]
 [1 0 3]
 [1 1 2]
 [2 0 2]
 [2 1 1]
 [3 0 1]
 [3 1 0]]

which you can check is the same data as the R version, only the output formatting is different.

Full disclosure: the code didn’t run first time — I’d made a tiny mistake in one of my assertions, that Copilot just translated without picking up on it. And when it did run, there was no output, because I’d forgotten the print(…) around the function call. But these were very minor issues, fixed in seconds.


Copilot generating Python from a description

I want to present another instance of Copilot helping with coding; in this case, not a conversion, but coding an algorithm from a description of what the algorithm should do.

As I’ve mentioned before (in the post Practical queries using binary operations), I occasionally toy around with a Sudoku solver I’ve written in SQL. There’s one piece of cell-checking code I haven’t written yet, so could Copilot write it for me?

My prompt:“I am writing a Sudoku solver. For each remaining cell, I have a list of possible values for that cell. For example, I have a row where four cells have been filled in: 1,4,6 and 8. The remaining possible cell values are (3,7),(2,3,5),(2,3,9),(5,9),(5,7,9) , where parentheses delimit cells. What algorithm can I use to find all possible combinations of cell values? E.g. for my example here, 3,2,9,5,7 is a possibility.”

Copilot wrote: You can use a recursive backtracking algorithm to find all possible combinations of cell values. Here’s a Python function that does this:

def solve_sudoku(possible_values, index=0, solution=[]):
    if index == len(possible_values):
        print(solution)
        return

    for value in possible_values[index]:
        if value not in solution:
            solve_sudoku(possible_values, index + 1, solution + [value])

possible_values = [(3,7),(2,3,5),(2,3,9),(5,9),(5,7,9)]
solve_sudoku(possible_values)

Looks promising… here’s the output when I run the code at Online Python:

[3, 2, 9, 5, 7]
[3, 5, 2, 9, 7]
[7, 2, 3, 5, 9]
[7, 2, 3, 9, 5]
[7, 3, 2, 5, 9]
[7, 3, 2, 9, 5]

It worked, first time.

Then, I asked if it could convert the code to SQL:

SQL is not designed for this type of recursive computation... That being said, you can use recursive Common Table Expressions (CTEs) in SQL for some types of recursive problems. However, the problem you’re trying to solve involves backtracking, which is not straightforward to implement in SQL.

Oh well, you can’t win them all! [I’m not entirely sure how I’m going to convert it to SQL, but the Python code has given me some ideas, so that’s better than nothing.]


Final thoughts

This all worked much better than I thought it would, and the odds are very good that I can convert all my R code to Python without too much work. Sure, the algorithms above aren’t that complicated, but none of my R code is that complicated; most of the functions are wrappers around calls to R libraries. If direct conversion doesn’t work, I can try asking Copilot to recreate the code from a text description of the input/output.

, , , , , , , ,

Leave a comment