I worked with Google’s Nano Banana a bit more over the past days, and I think I understands what it is doing under the hood.

“Regular” imaging LLMs predict pixels, you give a prompt, the prompt gets translated into a series of tokens, and the model predicts the best matching pixels given the token input. A flat “soup of pixels” is the result. And because of that, it is hard to make small adjustments to an image, editing one particular aspect and leaving everything else as is.

I suspect Nano Banana works with layers. The model tries to understand what aspect of it refers to the bottom of the pile (the background) and what elements go on top. As a result, it is possible to make very precise edits to individual objects in the overall composition of the image.

In order to make a coherent image, the model needs to have a good understanding of the 3D perspective of the background, and all the objects above it. Like the example about the Porsche in a Dutch town in my previous post, the car gets rotated, and pasted back into the background image with the correct vanishing point in mind.

Vanishing point is preserved when making edits to the image

What the model cannot do is change camera position. view the entire image from a completely different angle. Zooming in and zooming out works. An example is the cover image of this post, where I took an image from the band of my son (Project71) and put them on a big stage. I could not get the model to produce a view from the audience given the image it already produced. (Starting from scratch with an explicit prompt for an audience view would have worked of course).

Note the small glitch in the keyboard of the synth

This is a limitation I can work with for the moment though.

PS. I work with Nano Banana via Google AI Studio, not via its own web site

More Nano Banana