From Bits to Atoms: Synthetic Intelligence in Physical Space
By Harrison Dahme, Hack VC Partner and CTO
Artificial intelligence has made astonishing progress over the past decade, driven in large part by the era of scaling - larger data, more parameters, and higher energy draw. But when you put AI into a body - a robot, a drone, a warehouse arm - the problem changes. Now it isn’t just “What’s in this image?” or “What word comes next?” It’s: What is happening in the physical world right now, and what should I do in the next 5 milliseconds so I don’t crash, break, or hurt someone?
We’re in the early stages of synthetic intelligence moving from bits to atoms. We’re no longer just manipulating symbols. We’re asking silicon to interface with the continuous fabric of time, space, matter, and causality. To turn math into motion, safely and repeatably, in a messy universe.
The dominant architectural paradigm for these models are Vision-Language-Action (VLA) models – transformers extended into the physical world, grounding perception and language in robot behavior.
However there are new explorations. The one that is perhaps the farthest along is Liquid Neural Networks (LNNs) – a newer, continuous-time architecture designed from the ground up for control, dynamics, and adaptation.
Each solves a different piece of the puzzle. Together, they define how machines will perceive, reason, and act in the real world. For investors, this split hints at where value will accrue: data markets, simulation, middleware, and efficient edge compute.
This piece walks through each architecture, compares their strengths and limitations, and explains why I think the future of robotics will be hybrid rather than dominated by a single model type.
First, Transformers: The Foundation of Modern AI
Transformers are the dominant architecture in modern AI. Introduced in 2017 by Vaswani et al., they became the backbone of large language models (LLMs) and large multimodal models. Conceptually, a transformer takes in a sequence of tokens (words, image patches, sensor readings) and learns relationships between them via attention. Attention lets the model selectively focus on different parts of the input when making predictions.
You can think of it as a very powerful pattern recognizer over sequences - excellent at reading the “subtitles” of reality, even if it doesn’t fully feel the forces underneath. They are also very conducive to scaling.
That design brings two huge advantages:
- Strong perception and reasoning: Transformers excel at recognizing objects, classifying images, interpreting text, and answering complex questions. They are very good at building high-level representations of the world.
- Massive scale and mature tooling: Data centers, GPUs, compilers, and libraries have all been optimized around attention-based models. This means extremely high training and inference throughput in the cloud.
But transformers have two structural limitations when you try to use them for real-world control on compute-constrained hardware:
- Discrete time, weak dynamics: Transformers see the world as a sequence of snapshots. They can predict the next element, but they do not have a native notion of continuous time or derivatives. They infer motion indirectly from past frames rather than modeling velocity, acceleration, and higher-order dynamics as first-class objects.
- Discrete outputs: Transformers naturally output tokens - discrete symbols. You can make them produce continuous values, but it’s bolted on rather than baked in. For many robotics tasks, you care about smooth, low-level control signals: joint torques, motor voltages, subtle changes in grip force.
For perception, these limitations don’t matter much. For control, they matter a lot.
Vision-Language-Action Models: Extending Transformers Into the Physical World
Vision-Language-Action (VLA) models take the transformer and bolt on the missing pieces needed for robots.
They typically have three parts:
- Vision module: processes camera images (and sometimes depth, LiDAR, or other sensors) to perceive the environment.
- Language module: interprets natural-language instructions and task descriptions.
- Action module: turns the internal representation into low-level control signals for the robot.
In practice, most VLAs today use a transformer backbone to fuse vision and language, then add a learned policy head that outputs actions. The result: a robot that can do things like “pick up the blue mug, not the red one” or “open the drawer halfway,” using the same family of models that power chatbots and image generators.
VLAs bring several important benefits:
- Grounded perception: They tie high-level semantics (“mug,” “drawer,” “halfway”) to specific pixels and sensor readings in the real world.
- Instruction following: They can map natural language instructions directly to sequences of actions, leveraging the vast language understanding packed into LLMs.
- Ecosystem leverage: Because they build on transformers, VLAs inherit mature tooling, pre-trained backbones, and well-understood training recipes.
But they also inherit most of the transformers’ weaknesses for control:
- Implicit physics: VLAs learn trajectories from data, but they do not have a native representation of dynamics. They approximate motion by replaying patterns they’ve seen, not by explicitly tracking how velocity and acceleration evolve in continuous time.
- Discrete or coarse control: Many VLA policies work over discretized time steps and action spaces (e.g., pick from a set of candidate waypoints or pose changes). This can be brittle for fine-grained tasks like precise assembly, surgical robotics, or high-speed flight. For the most part, the planning policy has to have a prior understanding of frequency, or how often to revisit the high-level goal vs make a fine motor adjustment.
- Hardware coupling: Because the action head is trained end-to-end on a specific robot’s data, the resulting policy can be tightly coupled to that chassis—its friction, payload, joint limits, and failure modes. Porting the same model to a different robot is often non-trivial. This can be mitigated somewhat by robust simulation engines with fine tuning, but even then, the model has no notion of hardware wear and tear.
VLAs, in other words, are excellent brains for seeing and understanding, but they are not yet the ideal substrate for continuous, safety-critical control.
Enter, Liquid Neural Networks: A Continuous-Time Model for Dynamics and Control
Liquid Neural Networks (LNNs) start from a different set of assumptions. Rather than treating the world as a sequence of discrete tokens, they come out of the signal-processing and continuous-time systems tradition. The modern work here is heavily influenced by the MIT CSAIL team and Liquid AI’s research.
Instead of just predicting the next value in a sequence, an LNN learns to evolve an internal state over time in response to incoming signals. Critically, the math allows it to approximate not just values, but derivatives:
- Position
- Velocity (first derivative)
- Acceleration (second derivative)
- Jerk (third derivative)
- And other higher-order dynamics
This gives LNNs a native sense of motion and change. They don’t just remember what frames they’ve seen; they maintain an internal dynamical system that tracks how the world is changing right now.
If transformers are incredible at reading the subtitles of reality, LNNs edge closer to feeling the underlying equations—the continuous curves that describe how things actually move.
That leads to several practical advantages for robotics and other embodied systems:
- Strong modeling of motion and dynamics: LNNs excel in tasks where the exact way something moves matters—flight control, legged locomotion, manipulation under changing loads, and wearables that must adapt to human gait.
- Continuous-time behavior: Because they are built around differential equations, LNNs can naturally handle variable time steps, sensor jitter, and long-horizon stability in a way that sequence models struggle with.
- Edge efficiency: LNNs tend to use far fewer parameters and require less memory movement than transformer decoders. That maps better to small, low-power chips embedded in robots, drones, vehicles, and wearables.
The tradeoffs are real:
- Younger ecosystem: Tooling, libraries, and hardware acceleration for LNNs are years behind transformers. The open-source ecosystem is early.
- Weaker high-level perception: Today’s LNNs do not match the multimodal perception and language capabilities of large transformer-based VLAs.
But where the rubber meets the road—literally—LNNs shine. They handle long-horizon stability, real-world drift, and on-the-fly adaptation far better than transformer-style control policies. If friction changes, a joint starts to wear, or the robot takes on a new payload, an LNN controller can often adapt in a live environment rather than requiring a full re-training loop.
Comparative Snapshot
The Emerging Hybrid Stack: VLAs for Perception, LNNs for Control
To me, in the near term, the most likely future architecture becomes obvious:
- Use transformers and VLAs for what they’re great at: perception, language understanding, high-level task planning, and semantic grounding.
- Use LNNs for what they’re great at: continuous-time control, stability, and fast adaptation on the edge.
In this hybrid setup:
- A VLA interprets the scene and the instruction: “Pick up the blue cup on the left, then place it gently on the top shelf.”
- It produces a high-level plan or set of waypoints: where the gripper should go, in what order, and under what constraints.
- An LNN-based controller then turns those waypoints into smooth, safe trajectories that respect joint limits, avoid collisions, and adapt in real time to small shifts in mass, friction, or pose.
This division of labor has several advantages:
- Safety: The continuous-time controller can enforce hard constraints (no joint overextension, no collisions, stable contact forces) even if the high-level model proposes an aggressive or slightly off-target move.
- Portability: You can swap out hardware—different arms, grippers, mobile bases—while reusing much of the perception stack, because the LNN layer handles the specifics of dynamics.
- Scalability: Large batched perception workloads stay in the data center (where transformers shine), while latency-sensitive control stays on-device (where LNNs shine).
This is how many complex engineered systems evolve over time: not toward a single monolithic everything model, but toward specialized components orchestrated in a pipeline.
Five Implications for Crypto × AI Investing
At Hack VC, we invest at the intersection of crypto and AI. The divergence between transformers, VLAs, and LNNs isn’t just an academic debate about architectures; it points to where value is likely to accrue over the next decade.
All of this is unfolding while the economic ground is shifting under our feet - knowledge work cycles compressing, capital and computation compounding, and massive reindustrialization efforts, enacting the same scaling laws to atoms as we’ve applied to bits.
Here are five concrete implications we’re tracking.
1. Data Becomes More Valuable - and More Fragmented
- VLAs hunger for multimodal, task-labeled perception data: video of robots operating in real environments, aligned with language instructions and success/failure outcomes.
- LNNs hunger for high-resolution dynamics data: joint torques, sensor streams, contact events, failures, and long-run telemetry under varied conditions.
This creates:
- Specialized data providers: robotics companies, simulation engines, and telemetry networks that own unique datasets about physical interaction.
- Incentive-aligned data markets: decentralized data networks, tokenized telemetry, and programmable royalties become natural ways to crowdsource and price this data, especially when data must be shared without ceding full control.
There is a lack of high fidelity, scalable kinetic model data. Some firms are side stepping this by training on video feeds and supplementing with teleoperation. However this requires massive hardware/software co-design and integration. There’s an analogy to Apple here – but for this purpose, we’re more interested in the Android equivalent. Open and modular ecosystems.
2. Inference Moves to the Edge
Many real-world systems cannot afford a round trip to the cloud for safety-critical decisions. A drone dodging a tree, or a robot reacting to a human in its workspace, needs sub-10 ms reactions.
- LNNs line up directly with this shift: low-parameter controllers running on small, cheap chips at the edge, with limited power draw.
- Investable wedge: companies building efficient edge inference stacks, from custom silicon to runtime libraries tuned for continuous-time models.
Crypto primitives can complement this with verifiable logs and attestations for what ran on-device—useful for safety, insurance, and regulation.
3. Hybrid Architectures Create a Middleware Layer
If VLAs and LNNs are separate components, someone has to manage:
- How perception outputs (Keyframes? Waypoints? Goals? Constraints?) flow into control.
- How updates are rolled out safely across fleets.
- How data from edge devices is aggregated, filtered, potentially obfuscated for privacy and data residency concerns, and turned back into training signal.
This opens space for orchestration and middleware platforms that:
- Standardize interfaces between perception stacks and control stacks.
- Manage fleet-level updates, A/B tests, rollbacks, and safety gates – all with verifiability, monitoring for anomalous behaviour (e.g., Crowdstrike).
- Offer APIs for plug-and-play modules (swap in a better LLM, a better VLA, or a new LNN controller without rewriting everything).
4. Crypto-Native Coordination for Robot Fleets
As robots become networked economic actors, crypto primitives map surprisingly well to what’s needed:
- Identity and reputation for individual robots and operators.
- Payments and micropayments for completed tasks, shared data, and services. Today it’s pay per click. With agents though, it is pay per crawl or pay per vector lookup. We think x402 will dominate here.
- Verifiable logging and audit trails for actions taken, especially in regulated environments.
- BFT security for when there are consensus failures in autonomous swarms.
Hybrid VLA + LNN systems will amplify this by increasing the number and diversity of autonomous agents. Decentralized coordination—across warehouses, cities, or even nations—starts to look much more like a crypto problem than a traditional SaaS problem.
5. Winning Teams Will Understand Both Models and Hardware
Many robotics teams today still default to “just use a simengine and transformer for everything.” That will work for demos and narrow domains, but as LNNs and other continuous-time architectures mature, the edge will shift toward teams that:
- Know how to combine VLAs and LNNs.
- Understand hardware constraints—latency, power, friction, wear—and can design controllers around them.
- Build full-stack systems where perception, control, data, and deployment are co-designed from day one.
For investors, diligence will increasingly involve asking not just “What’s your model?” but "How does your architecture map onto the physics and hardware of your problem space?"
Conclusion
Transformers set the standard for perception and abstract reasoning. Vision-Language-Action models extend that power into embodied tasks, connecting what a robot sees and hears to what it does. Liquid Neural Networks deliver the continuous-time dynamics and adaptability that real-world control demands.
No single architecture will win outright. The most capable systems will be hybrids: transformers and VLAs for seeing and understanding, LNNs for moving and adapting.
At a deeper level, this is what it looks like when math and matter start to rhyme: one family of models learning the symbols of the world, another learning how those symbols touch physics.
That shift will create new categories of opportunity: robotics foundation models, dynamics-focused controllers, simulation engines, data networks, and edge intelligence platforms—plus the crypto infrastructure to coordinate them all.
At Hack VC, we believe the most important breakthroughs will come from teams that combine advanced perception with continuous, adaptive control, who are also designing with security and regulation as a first principle and that design explicitly for real-world constraints. If you are building toward that future, we’d love to hear from you.
Disclaimer
The information herein is for general information purposes only and does not, and is not intended to, constitute investment advice and should not be used in the evaluation of any investment decision. Such information should not be relied upon for accounting, legal, tax, business, investment, or other relevant advice. You should consult your own advisers, including your own counsel, for accounting, legal, tax, business, investment, or other relevant advice, including with respect to anything discussed herein.
This post reflects the current opinions of the author(s) and is not made on behalf of Hack VC or its affiliates, including any funds managed by Hack VC, and does not necessarily reflect the opinions of Hack VC, its affiliates, including its general partner affiliates, or any other individuals associated with Hack VC. Certain information contained herein has been obtained from published sources and/or prepared by third parties and in certain cases has not been updated through the date hereof. While such sources are believed to be reliable, neither Hack VC, its affiliates, including its general partner affiliates, or any other individuals associated with Hack VC are making representations as to their accuracy or completeness, and they should not be relied on as such or be the basis for an accounting, legal, tax, business, investment, or other decision. The information herein does not purport to be complete and is subject to change and Hack VC does not have any obligation to update such information or make any notification if such information becomes inaccurate.
Past performance is not necessarily indicative of future results. Any forward-looking statements made herein are based on certain assumptions and analyses made by the author(s) in light of their experience and perception of historical trends, current conditions, and expected future developments, as well as other factors they believe are appropriate under the circumstances. Such statements are not guarantees of future performance and are subject to certain risks, uncertainties, and assumptions that are difficult to predict.