My favourite days are the ones where I get to solve a seemingly difficult everyday problem with mathematics. A few weeks ago, my friend Andrei came to me via IRC with a question, which I have heavily paraphrased:

I've got a list of the beers in my cellar, and each beer has a price associated with it. I'd like to figure out how to generate a list of fair trades between myself and another person who has their own list of beers and prices, for any combination of bottles between the two of us.

Or, more generally: given two lists of items, each item with a fixed value, how may we calculate the possible combinations of “equal” groups of one or more items for each list when comparing to the other list in question?1

A Computational Complexity Problem

As it turns out, the above question may be neatly encapsulated by the subset sum problem. And, as with most interesting problems in computability, exact solutions to SUBSET-SUM are NP-Complete.

We are in luck, however. SUBSET-SUM may be solved approximately via a deterministic, fully polynomial-time approximation scheme, or FPTAS2. For NP-Complete problems that don't require exact solutions, that's about as good as it gets.

First, let's describe how we can arrive at the approximate subset sum algorithm3, and then we'll look at some python code that will generate a set of results for us.

The Decision Problem

The decision problem for subset sum is to determine if there exists any subset of a set that adds up exactly to a given value. The naïve, brute-force method of arriving at a solution would be to generate the powerset of our given item list, and then scan through all generated subsets and their respective sums until we find a match (if any). Generating the powerset of a set of N items has a time complexity of O(2N), which indicates that this naïve method will quickly become unreasonable for modern processors once we go above ~35 input items.

def powerset(iterable):
    s = list(iterable)
    return chain.from_iterable(
        combinations(s, r) for r in range(len(s)+1))

print powerset([1,2,3])
# >>> () (1,) (2,) (3,) (1,2) (1,3) (2,3) (1,2,3)

The Optimization Problem

A somewhat obvious improvement to the above Python code is that, when generating the subsets for the powerset, we calculate the partial sums of each as they are created. If the partial sum of a subset goes above our target total at any point, we may discard that subset.

This does not change the worst-case time complexity for the given decision problem, but it does provide us with a framework on which we can build the associated optimization problem4 of finding a subset sum that is within a given tolerance of our target total.

Pseudocode

The next observation is that we don't really care about the how the partial sums are constructed (we will later, but for the purposes of exposition we'll forget about that for the moment). Thus, we don't really need to keep track of all the elements within the subset, only the relevant partial sum for each. This leads us to the EXACT-SUBSET-SUM algorithm, which simply computes the list of sums of all subsets that do not exceed a specified target total for a given input list:

    EXACT-SUBSET-SUM(S, t)
        L = [0]
        for i = 1 to LENGTH(S)
            Li = MERGE-LISTS(Li-1, Li-1 + xi)
            for j = 1 to LENGTH(L)
                if Lj > t then DELETE Lj

        return MAX(L)

See 5 for an explanation of MERGE-LISTS.

This does not improve the worst-case time complexity when compared to the brute-force generation of the entire powerset, but it does set us up more readily for the approximation algorithm, where we introduce a TRIM procedure that scans the intermediate lists for values that are within δ of each other and combines them into a single value. With a little bit of work6, this can be shown to be an FPTAS:

    APPROXIMATE-SUBSET-SUM(S, t, ε)
        L = [0]
        n = LENGTH(S)
        for i = 1 to n
            Li = MERGE-LISTS(Li-1, Li-1 + xi)
            Li = TRIM(Li, ε/2n)
            for j = 1 to LENGTH(L)
                if Lj > t then DELETE Lj

        return MAX(L)

The Implementation, in Python

Since the impetus for this discussion was finding a computable solution to a real-world problem, we would be remiss to not include something a bit more tangible than pseudocode. We've also made some slight modifications to the classical solution, in that internally we use a list of hashes instead of simply a list of integers that correspond to the partial sums.

These hashes allow us to track both the current partial sum for that list entry in addition to tracking which list elements were used to construct that particular partial sum. In the context of computing possible groupings of beers to trade, knowing which beers belong in a group is just as essential as their combined total value.

The hashes (or dictionaries, as they're called in Python) are composed of two keys, where value tracks the current partial sum total and partials keeps track of which values were used to arrive at said value:

{'value': <integer>, 'partials': [<integer>, <integer>, ...]}

Note: This is by no means the only way of tracking which list item elements are used to construct a partial sum, nor is it the most efficient (e.g. we trade memory for a slightly smaller constant in the time complexity due to the caching of the partial sum total instead of computing the total on each iteration)</p>

First, we implement the TRIM procedure, which merges list elements when they are within a given δ of each other. Note that for the below implementation to work, we must ensure that the input data list is sorted:

def trim(data, delta):
    """Trims elements within `delta` of other elements in the list."""

    output = []
    last = 0

    for element in data:
        if element['value'] > last * (1 + delta):
            output.append(element)
            last = element['value']

    return output

Next, we implement MERGE-LISTS to combine two lists and sort the result using the most excellent itertools library that leverages generators:

import itertools
import operator

def merge_lists(m, n):
    """
    Merges two lists into one.

    We do *not* remove duplicates, since we'd like to see all possible
    item combinations for the given approximate subset sum instead of simply
    confirming that there exists a subset that satisfies the given conditions.

    """
    merged = itertools.chain(m, n)
    return sorted(merged, key=operator.itemgetter('value'))

And finally, we combine TRIM and MERGE-LISTS as described in our APPROXIMATE-SUBSET-SUM algorithm pseudocode, ensuring that we properly track the computed partial sums and their corresponding elements.

def approximate_subset_sum(data, target, epsilon):
    """
    Calculates the approximate subset sum total in addition
    to the items that were used to construct the subset sum.

    Modified to track the elements that make up the partial
    sums to then identify which subset items were chosen
    for the solution.

    """

    # Intialize our accumulator with the trivial solution
    acc = [{'value': 0, 'partials': [0]}]

    count = len(data)

    # Prep data by turning it into a list of hashes
    data = [{'value': d, 'partials': [d]} for d in data]

    for key, element in enumerate(data, start=1):
        augmented_list = [{
            'value': element['value'] + a['value'],
            'partials': a['partials'] + [element['value']]
        } for a in acc]

        acc = merge_lists(acc, augmented_list)
        acc = trim(acc, delta=float(epsilon) / (2 * count))
        acc = [val for val in acc if val['value'] <= target]

    # The resulting list is in ascending order of partial sums; the
    # best subset will be the last one in the list.
    return acc[-1]

The above must be invoked with a list of integers for data, a target total, and an approximation parameter epsilon which must be between 0 and 1 (e.g. epsilon=0.2 indicates that the target total must be within 20% of the optimal answer).

A Gist of the code has been provided convenience.

A Test Run

Let's test this out with a bit of real-world data. Assume that I have a cellar consisting of the following beers and their value7:

  • Bottle A: $6.50
  • Bottle B: $12.00
  • Bottle C: $13.50
  • Bottle D: $4.50
  • Bottle E: $28.75
  • Bottle F: $16.25
  • Bottle G: $15.00
  • Bottle H: $18.75

I've decided that I'd like to make a trade with my friend Andrei for some of his beers, and we've both agreed a priori that we'd like to keep the trade at less than or equal to $34.50 in value. Since SUBSET-SUM and APPROXIMATE-SUBSET-SUM are integer algorithms (think for a moment as to why this is), we multiply all dollar amounts by 100 and put them in a list. Additionally, Andrei and I have determined that a 20% delta on the target total is acceptable (we are friends, after all):

data = [650, 1200, 1350, 450, 2875, 1625, 1500, 1875]
target = 3450
epsilon = 0.2

print approximate_subset_sum(data, target, epsilon=epsilon)
# >>> {'partials': [0, 650, 1200, 1500], 'value': 3350}

We can discard the trivial 0 item that is an artifact of the above implementation (which could easily be removed in the approximate_subset_sum function itself), and our solution of bottles [A, B, G]8 is well within 20% of the target total. Now all Andrei needs to do is run the same script with the same target but using his own inventory as input data, and we have ourselves a trade.

Or he could just do what normal people do, and pick out a few bottles that he thinks I would enjoy. But where's the fun in that?

  1. Incidentally, this kind of question is one that people have been asking themselves for much of recorded human history in the context of bartering: I've got things that you want, you've got things that I want. Let's figure out a fair trade of items based on their perceived value.
  2. The difference between a FPTAS and a PTAS is subtle, but an important one: A PTAS is an approximation whose time complexity is polynomial in the input size. An FPTAS guarantees that the time complexity is polynomial in \(\frac 1\epsilon\) in addition to being polynomial in the input size. Practically, this means that the time complexity for a PTAS may increase dramatically as the approximation parameter \(\epsilon\) decreases, while an FPTAS will not.
  3. For a more in-depth discussion on the approximation algorithm for subset sum optimization problems, please refer to Cormen et. al, 3rd Edition, p.1128-1134
  4. Informally, a decision problem is one where we only wish to determine if a solution exists (a yes/no question). An optimization problem is one where we wish to find the best solution.
  5. MERGE-LISTS returns sorted merge of both input lists with duplicates removed. A Python code example is provided later on in the post.
  6. Cormen et. al., p.1132-1133
  7. Note that the "value" of the items might have nothing to do with their actual cost, as is often the case with beer, wine and other aged spirits. As long as both parties agree on the abstract value assigned to all items, then all is well.
  8. One small enhancement to our Python implementation would be to allow the tracking of additional item metadata other than the value, for the sake of convenience.