At Polarr, we view our sophisticated A.I. techniques as tools to aid in the creative process. We observed a great opportunity to use reinforcement learning (RL) to aid our creatives in the complex and sometimes arduous task of editing their optimal creative content. We decided to explore the use of reinforcement learning in supporting our users during their creative photo and video editing workflows. This is what we term Reinforced Creativity. Obtaining labeled data for such tasks is complex and time-consuming human work; because it can be very costly, we formulated a proxy reward function to learn desired editing behavior in a much more efficient way than traditional methods, leading to interpretable and tunable model outputs that we can now use to improve editing processes.

Furthermore, RL is great for solving problems where the optimal solution is unknown, yet has some measure of what is considered good. In this way, we can solve problems for which it's too expensive to assemble a custom dataset. RL allows us to focus our model on promising regions of our search space, allowing us to be more sample efficient and to use a smaller model with fewer parameters.

Some context.

For the A.I. research we do at Polarr, we often face problems where labeled datasets are unavailable, and sampling a large dataset by ourselves is prohibitively costly. This is especially true when the task at hand contains creative work done by a human expert with a subjective range of possible high-quality outcomes, such as in professional photo retouching. While there is enormous potential for supporting human creativity with machine learning, obtaining suitable training data turns out to be particularly difficult in this creative content context.

Computer vision has made impressive leaps in the last decade not only due to the arrival of more advanced model architectures and increasing processing power but to a large extent the release of massive labeled datasets becoming available for tasks such as object detection or image segmentation. While still relying on a considerable amount of human work, these massive datasets are often created semi-automatically by scraping suitable sources of data on the internet. Especially supervised learning has benefited extremely from this ever-growing abundance of labeled data.

ML dataset sizes over time: this chart exhibits the size of public image datasets over time: indicating a clear trend of larger collections becoming available, growing from 60k MNIST examples in 1998 to more than 14M examples in ImageNet from 2009. Data Reference

Automatic Photo Adjustments.

Recently at Polarr, we became interested in deciphering (and training a machine learning model that could understand) how experts in photo editing work from given raw materials to their refined assets that exhibit strong aesthetic and harmonic photography. To this end, we had access to a set of several thousands of images, before and after shots of photos, being edited by professionals (kindly provided by an event photography company). The key missing information, to be learned by our machine learning model, was the exact sequence of steps experts took to transition from raw originals to stunning pieces of art.

One approach to identifying these steps, often found in literature, is to train a generative model in an end-to-end way that directly manipulates any input image to look similar to a professionally retouched version of itself. This often combines a fidelity loss (making sure that the image is still recognizable) and one or several quality losses (making sure the image looks aesthetically pleasing and similar to the expert ground truth). The output of the model at test time can then either be an edited photo, or a mask that can be applied to an input picture to change items such as lighting and color.

This type of literature approach can be trained both supervised (with pairs of images before/after editing) or unsupervised (only with examples of edited images) using a GAN-like architecture.

As an example for a recently published end-to-end adjustment model, this publication from CVPR 2019, "Underexposed Photo Enhancement using Deep Illumination Estimation", calculates an enhanced exposure map for a given input image, trained on examples of originals and edited images, using a combination of different losses for fidelity, colors, and smoothness.

Recent glamorous example of an end-to-end architecture for photo retouching. In this case, the exposure of the input image is changed, trained on pairs of originals and expert edits. Based on: Underexposed Photo Enhancement using Deep Illumination Estimation.

However, this approach comes with a great disadvantage: the output will be a range of changed pixels, but what we originally cared for was finding a more understandable and tune-able representation of how experts arrive at the retouched version of any given photo. While it’s possible to interpolate between the original and the generated output, we cannot easily change the strength of individual parts of the solution, such as keeping exposure changes the same while decreasing the saturation of the output. End-to-end models as the ones mentioned above are not very interpretable and they do not easily allow a user to slightly modify the output with a few meaningful steps.

To better capture the necessary data, one can instead use the same end-to-end loss functions as mentioned before, but at the same time incorporate explicit high-level adjustments for differing light and color adjustments as part of our machine learning graph. This means we create an intermediate layer in the neural network that outputs a few adjustment values as input for fully differentiable operations, such as changing the exposure of the entire image. These adjustment values can then tell us much more about how experts edit images. This is thus a case of differentiable programming, where we can now turn our photo editing algorithms into a differentiable and interpretable code.

Through the incorporation of global adjustment operations (shifts in contrast, saturation, brightness, etc.) as differentiable components in the machine learning graph, we can now back-propagate our losses to learn a better predictor of adjustment values given some input image (differentiable programming).

This approach in adding differentiable adjustment components (outlined above) allows us to extract a few meaningful values from the model, further facilitating our understanding of the necessary steps to enhancing a given input image. Still, in practice, this setup turns out to be quite impractical. Firstly, it limits us to only using differentiable operations excluding support for the full range of adjustments available in our photo editor. Secondly, translating photo adjustments into a differentiable graph leads to a heavy and slow network that is arduous to train due to its long inference time. Additionally, when we do make changes to the way our adjustment operations are implemented in the app, we would have to maintain consistency and find the exact parts of the graph to update accordingly; thus it does not adapt well to change.

Our Approach

Instead of requiring adjustments to be part of our differentiable ML graph, we reached deep into our bag of tricks and turned to the field of reinforcement learning. With reinforcement learning, the output of our network is passed to an environment that acts as a ‘black box’ which is by default not differentiable. In this case, we can no longer back-propagate through some supervised loss, and can instead make use of a provided reward signal which simply informs us on the quality of different courses of actions that were taken in the editing process.

There are many different possible reward signals that we could have chosen to equate to quality. For this project, we opted to use a relatively simple metric based on the CIEDE2000 distance - a distance metric in the L*a*b* color space that aims to capture the nonlinear human perception of color similarity. What we ultimately care about is how close to the expert's ground-truth our results look to human users, which calls for the use of a human-centered distance metric. Our initial experiments showed that this reward signal is very well aligned with our subjective estimates of the resulting quality. Although this reward function is not perfect in capturing the objective we ultimately care about, it finds a sweet spot between accuracy and computational efficiency.

Depicted above, we see the non-linearity of human perception of color similarity in a MacAdam ellipse. In our reward function, we use the CIEDE2000 metric to account for this disparity when evaluating the effect of proposed adjustment values. Reference

Our implementation.

Our model is provided with an input image ("state" in RL speak) and outputs a vector of adjustment values ("actions"). The image and adjustments are then passed to our photo editor ("environment"), where an edited output image is computed. The reward is based on comparing the color distance of the output image with the color distance of the original image. If the distance decreased, we return a positive reward, if the distance increased we send a negative reward. In this way, we can guide our RL model to avoid edits that decrease the aesthetic quality of the image and focus on adjustments that produce expert-like outputs and come with a strong positive reward.

At this point, we needed a way to train our neural network based on this reward function. We used a common set up of actor-critic-training: for any state and action, a critic network tries to predict the reward returned by the environment. The critic learns to be more and more accurate in judging different actions and soon can be used to train the actor. The actor's new objective then simply focuses on maximizing the reward predicted by the critic.

Actor-Critic setup of our training architecture. The critic is trained to most accurately predict the reward for any given image and its adjustments. The actor is trained to maximize the predicted reward provided by the critic.


It is helpful to take a step back and look at the advantages provided by this setup:

  • We no longer need to know the ground truth of our desired outputs (remember we only know what our target image looks like, but are interested in uncovering its underlying adjustments). We don't know what actions were taken by the experts in question, but can still learn the actions that can yield similar resultant images.
  • There is no need to translate all operations taken in our photo editor into differentiable programming. We can easily deal with having non-differentiable parts of our ML graph by moving them into this new black box environment.

While these advantages allow us to efficiently learn the model outputs, RL confronts us with the exploration-exploitation dilemma; at each step, we have to decide between acting according to our best guess of what a good action is, or exploring and picking another action to learn more about the problem. We use a standard ε-greedy policy with linearly decaying ε to tackle this issue. At each step, we pick a random action with probability ε, and slowly decrease this chance over time as we become more sure about what constitutes a good action.

There is another advantage to our RL setup that is a bit less obvious. Even for one of our relatively small models, the search space of possible adjustment combinations grows quickly the more operations we allow. Even when only allowing the 12 global adjustments for light and color (temperature, saturation, exposure, ...), there are 200^12 possible combinations of adjustments. Most of these combinations will produce terrible outputs, so randomly sampling actions for training will spend a lot of time learning how to compare very bad options.

With reinforcement learning we can sample actions from our actor, allowing us to focus more on high-quality actions while the model training progresses. While some exploration is still needed to make sure the critic remembers what constitutes a ‘bad action’, we can approximate the important areas of the loss landscape much more quickly than sampling adjustments uniformly from all possible combinations. As we are not aiming to accurately predict the reward of very bad actions, this approach is more sample efficient and allows us to use a smaller critic model with fewer parameters.


With this RL-based training procedure, we were able to train an actor model that almost always predicted high-quality adjustments from our photo editor, within the limited scope of indoor-event photography. A big boost in results was achieved through the augmentation of unedited originals with a few random adjustments (e.g. adding under-exposed examples to the dataset). In this way, the actor became stronger regarding low-quality inputs that were not necessarily taken by professional photographers.

For Polarr, the best next steps for this exciting project are to explore the possibilities of personalization, where each editor gets their own unique model according to their taste. We also plan to look into more elaborate reward functions, for example using an attention model to penalize color differences in important regions (such as faces) more strongly.

Excited about reinforced creativity? We're looking for engineers! Check out our careers page, or simply drop us a line.