Sunday, December 31, 2017

CapsNet, capsules, vision as 3D-reconstruction and re-rendering and mainstream approval of ideas and insights of Boris Kazachenko and Todor Arnaudov

First impressions on Hinton et al. "Capsules"/CapsNet update to the convolutional NN/CNN that got popular recently with their latest paper on Dynamic routing.

1. Hinton approves Boris Kazachenko's old claim and criticism to ANN in his Cognitive Algorithm (CogAlg) writings that the coordinates of the input should be preserved and that this is one of the CNN/ANN design faults.

2. The "Dynamic routing" sounds to me as their way to generate "new syntax" in CogAlg terms, as  different ways for evaluation of the input. Boris disagreed though, he corrected that it maps to his "skipping" (of levels).

3. The intended focus on particular smaller-region features per "capsule"/"group of neurons" ~ (mini-)columns reminds me of  Numenta/Jeff Hawkins' approach, i.e.: a) cortical algorithm - a structure of functional modules, not just "neurons" b) higher modularity

All of the above seems as steps ahead to finer granularity of the patterns that the systems would model.

4. Besides, if I understand correctly, Hinton agrees with my claim/observation in early 2012 that vision/(object recognition) is ultimately 3D-reconstruction* and comparing normalized 3D-models of various level of detail - "inverse graphics".

My view* is that "understanding" is the ability of the system to re-render what it sees with adjusted or with changed parameters, which, in their terms seems to map to keeping the "equivariance" (or "match" in CogAlg terms), or as I see it: to simulate/traverse the pattern in the space of its possible states.

That’s according to:“Does the brain do Inverse graphics”, published in Youtube on 25.08.2015, a record from a lecture in a “Graduate summer school”, Toronto, 12.7.2012” from:

Slides by Kyuhwan Jung, 9/11/2017: ...p.8: “...We need equivariance, not invariance

* To me it's supposed to be obvious, I think it's obvious to cognitive psychologists (Hinton mentions the mental rotation tests), to artists, to researchers, to ones who study human vision and optical illusions.

Another earlier article of mine from 1.1.2012:

 Colour Optical Illusions are the Effect of the 3D-Reconstruction and Compensation of the Light Source Coordinates and Light Intensity in an Assumed 2D Projection of a 3D Scene


 However it wasn't obvious for example in the AGI community below and if one is doing messy ANN where there's no reconstruction, but "weights", "convolutions". All were talking about "invariance".

** Boris' comment on capsules in his site:

"Actually, recently introduced “capsules” also output multivariate vectors, similar to my patterns. But their core input is a probability estimate from unrelated method: CNN, while all variables in my patterns are derived by incrementally complex comparison. In a truly general method, the same principles must apply on all stages of processing. And additional variables in capsules are only positional, while my patterns also add differences between input variables. That can’t be done in capsules because differences are not computed by CNN.


Archive from the AGI List from the year 2012

At that time the "invariance" was a buzz-word in the AGI email list. See more below in the digest I've prepared from 4 threads from that era back in 2012. I've not visited that place since a long time, the emails should be there if it's still active.

1. Generalization – Food and Buildings, 1/2012
2. General Algorithms or General Programs, 4/2012
3. Generalization - Chairs and Stools , 10/2012
4. Caricatures, 5/2012

Read in:  Chairs, Caricatures and Object Recognition as 3D-reconstruction (2012)

The 4-th email from the "General algorithms..." thread:

Todor Arnaudov Fri, Apr 27, 2012 at 1:12 AM

I don't know if anyone on this discussion realized, that "Invariance" in vision is actually just a

- 3D-reconstruction of the scene, including light source and the objects

- Also colours/shades and the textures (local/smaller higher resolution models) are available (for discrimination based on this, may be quicker/needed for objects which are otherwise geometrically matched)

[+ 16-7-2013 - conceptual “scene analysis”, “object recognition” involves some relatively arbitrary, or just flexible, selection criteria for the level of generalization for the usage of words to name the “items” in the scene. To Do: devise experiments with ambiguous objects/scenes, sequences. … see “top-down”, … emails 9, 14, 15]

If the normalized 3D-models (preferably to absolute dimensions), lights and recovered original textures/color (taking into account light and reflexion) are available, everything can be compared perfectly and doesn't require anything special, and no "probabilities" or something. The textures and light most of the time don't even alter the essential information - the 3D-geometric structure.

"2D" is just a crippled 3D

"Invariants" in human fully functional vision are just those 3D-models (or their components, "voxels:) built in a normalized space,the easiest approach for quick comparison is voxels, it might be something mixed with triangles, of course textures and colours also participate.

Every 3D-model has a normalized position per its basis, and also some characteristic division of major planes and position between the major planes, and there are "intuitive" ways to set the basis --> gravity/the ground plane foundations, which is generalized to "bottom", i.e.:

-- The "bottom" of an object, which faces the ground, is the part of the image of the object which projects on the "bottom" of the scanlines of the retina, because that's inferred for the first objects, which always have stable touch with the "ground".

When generalizing or specializing, the resolution of the 3D-models to be compared is changed (see the thread where I gave example of how the concept of a "building" is produced), at particular stage every two 3D-models match, eventually converge to a cube, or a plane.

IMO in fact brain is not very good in further mental rotation of those models, yeah we know those IQ tests, but humans do it very slowly and the tests consist of very few crossing planes, because it gets too complex.

Con: "How can you say that it's "just" 3D-reconstruction? That's so compex!"

- Well, one may think so only if she was not familiar with the triangulation (photogrammetry dates back to 19-th century) and/or the spectacular work of Mark Polleyfeys.

"How do you recognize that this is your chair, if it's upside down and you haven't seen it before"

Like the mistakes about generalization - a "chair" is a generalized concept, it's not a pixel-by-pixel image, rough 3D-models are compared for finding a match. And matching is a biggest number of high degree of match of the size relations of the boxes, planes color (after light correction) + texture, to the match to those of the chair from the previous day, than to those boxes, planes etc. of "chairs" found elsewhere, and of any other "objects".

A "chair" [a stool] generally is just:

-- A plane which is perpendicular to the "ground" direction vector, which is a vector which is parallel to "gravity" - that is the vector where objects go when let without a support;
-"support" is a vector consisting of "solid" connection (of forces, impacts) to the "ground" which when existing prevents objects from getting closer to the "ground" (falling);
- the "ground" is a plane where objects stop their motion (changes of coordinates between subsequent samples) if left without support or impacting by other moving "things", etc.

Most chairs can be reduced to a few solids and still be recognizable.

AGI is way simpler than it seems.