by Haytham ElFadeel - [email protected]
2018
A recurring failure mode in AI research is task isolation: we optimize narrowly scoped problems (either: language, vision, planning, or control) as if these faculties were separable in the real world. This produces systems that are statistically competent on benchmarks yet brittle under distribution shift, weak at commonsense inference, and poor at connecting perception to action.
A more scientifically grounded path toward generalist AI is to treat intelligence as an integrated perception–action–language loop: learning representations that preserve the structure of the environment, support counterfactual/causal reasoning, and ground symbols in sensorimotor experience.
Many ML pipelines begin by factorizing reality into academically convenient subproblems, then training models inside those boundaries:
This decomposition is productive for engineering (it yields tractable benchmarks and modular systems), but it is scientifically risky if the long-term goal is robust, general intelligence. In natural cognition, semantics, perception, and action are mutually constraining: language refers to objects/events in the world; perception is shaped by physics and affordances; and planning depends on causal dynamics and embodied constraints. Treating these as independent encourages systems that learn surface correlations rather than useful structure.
A technical way to express “holism” is to model an agent interacting with an environment, e.g., as a (PO)MDP. The research question becomes:
Can we learn a representation $z_t = f_\theta(o_{\le t}, a_{< t})$ that is (i) predictive of future observations under interventions, (ii) suitable for planning/control, and (iii) supports grounded semantics for communication?
This framing makes the cost of siloing explicit:
if you learn language without grounding, or vision without dynamics, you learn a representation that is not sufficient for planning and causal inference.