My favourite days are the ones where I get to solve a seemingly difficult everyday problem with mathematics. A few weeks ago, my friend Andrei came to me via IRC with a question, which I have heavily paraphrased:
I've got a list of the beers in my cellar, and each beer has a price associated with it. I'd like to figure out how to generate a list of fair trades between myself and another person who has their own list of beers and prices, for any combination of bottles between the two of us.
Or, more generally: given two lists of items, each item with a fixed value, how may we calculate the possible combinations of “equal” groups of one or more items for each list when comparing to the other list in question?1
A Computational Complexity Problem
As it turns out, the above question may be neatly encapsulated by the subset sum problem.
And, as with most interesting problems in computability, exact solutions to
SUBSET-SUM
are NP-Complete.
We are in luck, however. SUBSET-SUM
may be solved approximately via a
deterministic, fully polynomial-time approximation scheme, or FPTAS2. For
NP-Complete problems that don't require exact solutions, that's about as
good as it gets.
First, let's describe how we can arrive at the approximate subset sum algorithm3, and then we'll look at some python code that will generate a set of results for us.
The Decision Problem
The decision problem
for subset sum is to determine if there exists any subset of a set that
adds up exactly to a given value. The naïve, brute-force method of
arriving at a solution would be to generate the powerset of our given item
list, and then scan through all generated subsets and their respective sums
until we find a match (if any). Generating the powerset of a set of N
items
has a time complexity of O(2N), which indicates that this
naïve method will quickly become unreasonable for modern processors once we go
above ~35 input items.
def powerset(iterable):
s = list(iterable)
return chain.from_iterable(
combinations(s, r) for r in range(len(s)+1))
print(powerset([1,2,3]))
# >>> () (1,) (2,) (3,) (1,2) (1,3) (2,3) (1,2,3)
The Optimization Problem
A somewhat obvious improvement to the above Python code is that, when generating the subsets for the powerset, we calculate the partial sums of each as they are created. If the partial sum of a subset goes above our target total at any point, we may discard that subset.
This does not change the worst-case time complexity for the given decision problem, but it does provide us with a framework on which we can build the associated optimization problem4 of finding a subset sum that is within a given tolerance of our target total.
Pseudocode
The next observation is that we don't really care about the how the partial
sums are constructed (we will later, but for the purposes of exposition
we'll forget about that for the moment). Thus, we don't really need to
keep track of all the elements within the subset, only the relevant partial sum
for each. This leads us to the EXACT-SUBSET-SUM
algorithm, which simply
computes the list of sums of all subsets that do not exceed a specified target
total for a given input list:
EXACT-SUBSET-SUM(S, t)
L = [0]
for i = 1 to LENGTH(S)
Li = MERGE-LISTS(Li-1, Li-1 + xi)
for j = 1 to LENGTH(L)
if Lj > t then DELETE Lj
return MAX(L)
See 5 for an explanation of MERGE-LISTS.
This does not improve the worst-case time complexity when compared to the
brute-force generation of the entire powerset, but it does set us up more
readily for the approximation algorithm, where we introduce a TRIM
procedure
that scans the intermediate lists for values that are within δ of each
other and combines them into a single value. With a little bit of work6, this
can be shown to be an FPTAS:
APPROXIMATE-SUBSET-SUM(S, t, ε)
L = [0]
n = LENGTH(S)
for i = 1 to n
Li = MERGE-LISTS(Li-1, Li-1 + xi)
Li = TRIM(Li, ε/2n)
for j = 1 to LENGTH(L)
if Lj > t then DELETE Lj
return MAX(L)
The Implementation, in Python
Since the impetus for this discussion was finding a computable solution to a real-world problem, we would be remiss to not include something a bit more tangible than pseudocode. We've also made some slight modifications to the classical solution, in that internally we use a list of hashes instead of simply a list of integers that correspond to the partial sums.
These hashes allow us to track both the current partial sum for that list entry in addition to tracking which list elements were used to construct that particular partial sum. In the context of computing possible groupings of beers to trade, knowing which beers belong in a group is just as essential as their combined total value.
The hashes (or dictionaries, as they're called in Python) are composed of
two keys, where value
tracks the current partial sum total and partials
keeps track of which values were used to arrive at said value
:
{'value': <integer>, 'partials': [<integer>, <integer>, ...]}
Note: This is by no means the only way of tracking which list item elements are used to construct a partial sum, nor is it the most efficient (e.g. we trade memory for a slightly smaller constant in the time complexity due to the caching of the partial sum total instead of computing the total on each iteration)
First, we implement the TRIM
procedure, which merges list elements when they
are within a given δ of each other. Note that for the below implementation
to work, we must ensure that the input data
list is sorted:
def trim(data, delta):
"""Trims elements within `delta` of other elements in the list."""
output = []
last = 0
for element in data:
if element['value'] > last * (1 + delta):
output.append(element)
last = element['value']
return output
Next, we implement MERGE-LISTS
to combine two lists and sort the result using
the most excellent itertools
library that leverages generators:
import itertools
import operator
def merge_lists(m, n):
"""
Merges two lists into one.
We do *not* remove duplicates, since we'd like to see all possible
item combinations for the given approximate subset sum instead of simply
confirming that there exists a subset that satisfies the given conditions.
"""
merged = itertools.chain(m, n)
return sorted(merged, key=operator.itemgetter('value'))
And finally, we combine TRIM
and MERGE-LISTS
as described in our
APPROXIMATE-SUBSET-SUM
algorithm pseudocode, ensuring that we properly track
the computed partial sums and their corresponding elements.
def approximate_subset_sum(data, target, epsilon):
"""
Calculates the approximate subset sum total in addition
to the items that were used to construct the subset sum.
Modified to track the elements that make up the partial
sums to then identify which subset items were chosen
for the solution.
"""
# Intialize our accumulator with the trivial solution
acc = [{'value': 0, 'partials': [0]}]
count = len(data)
# Prep data by turning it into a list of hashes
data = [{'value': d, 'partials': [d]} for d in data]
for key, element in enumerate(data, start=1):
augmented_list = [{
'value': element['value'] + a['value'],
'partials': a['partials'] + [element['value']]
} for a in acc]
acc = merge_lists(acc, augmented_list)
acc = trim(acc, delta=float(epsilon) / (2 * count))
acc = [val for val in acc if val['value'] <= target]
# The resulting list is in ascending order of partial sums; the
# best subset will be the last one in the list.
return acc[-1]
The above must be invoked with a list of integers for data
, a target
total,
and an approximation parameter epsilon
which must be between 0 and 1 (e.g.
epsilon=0.2
indicates that the target total must be within 20% of the optimal
answer).
A Gist of the code has been provided convenience.
A Test Run
Let's test this out with a bit of real-world data. Assume that I have a cellar consisting of the following beers and their value7:
- Bottle A: $6.50
- Bottle B: $12.00
- Bottle C: $13.50
- Bottle D: $4.50
- Bottle E: $28.75
- Bottle F: $16.25
- Bottle G: $15.00
- Bottle H: $18.75
I've decided that I'd like to make a trade with my friend Andrei for
some of his beers, and we've both agreed a priori that we'd like to keep
the trade at less than or equal to $34.50 in value. Since SUBSET-SUM
and
APPROXIMATE-SUBSET-SUM
are integer algorithms (think for a moment as to why
this is), we multiply all dollar amounts by 100 and put them in a list.
Additionally, Andrei and I have determined that a 20% delta on the target total
is acceptable (we are friends, after all):
data = [650, 1200, 1350, 450, 2875, 1625, 1500, 1875]
target = 3450
epsilon = 0.2
print approximate_subset_sum(data, target, epsilon=epsilon)
# >>> {'partials': [0, 650, 1200, 1500], 'value': 3350}
We can discard the trivial 0
item that is an artifact of the above
implementation (which could easily be removed in the approximate_subset_sum
function itself), and our solution of bottles [A, B, G]
8 is well within 20%
of the target total. Now all Andrei needs to do is run the same script with the
same target but using his own inventory as input data, and we have ourselves a
trade.
Or he could just do what normal people do, and pick out a few bottles that he thinks I would enjoy. But where's the fun in that?
Incidentally, this kind of question is one that people have been asking themselves for much of recorded human history in the context of bartering: I've got things that you want, you've got things that I want. Let's figure out a fair trade of items based on their perceived value.
The difference between a FPTAS and a PTAS is subtle, but an important one: A PTAS is an approximation whose time complexity is polynomial in the input size. An FPTAS guarantees that the time complexity is polynomial in (\frac )Practically, this means that the time complexity for a PTAS may increase dramatically as the approximation parameter (\epsilon) decreases, while an FPTAS will not.
For a more in-depth discussion on the approximation algorithm for subset sum optimization problems, please refer to Cormen et. al, 3rd Edition, p.1128-1134
Informally, a decision problem is one where we only wish to
determine if a solution exists (a yes/no
question). An
optimization problem is one where we wish to find the best
solution.
MERGE-LISTS returns sorted merge of both input lists with duplicates removed. A Python code example is provided later on in the post.
Cormen et. al., p.1132-1133
Note that the "value" of the items might have nothing to do with their actual cost, as is often the case with beer, wine and other aged spirits. As long as both parties agree on the abstract value assigned to all items, then all is well.
One small enhancement to our Python implementation would be to allow the tracking of additional item metadata other than the value, for the sake of convenience.