The Microgrid Controller I Almost Bought (And Why I Built My Own)

Hero image for The Microgrid Controller I Almost Bought (And Why I Built My Own)

The Microgrid Controller I Almost Bought

I spent six months evaluating commercial microgrid controllers. Twelve vendors. Three continents. Every single one had the same fatal flaw.

So I built my own.

The Commercial Controller Problem

Commercial controllers share a common architecture: a central PLC or industrial PC making all decisions, with IEDs (Intelligent Electronic Devices) handling protection.

flowchart TD
    subgraph Central["Commercial Architecture"]
        C[Central Controller]
        H[Human Machine Interface]
        C --> H
        
        subgraph IEDs["IEDs / Protection Relays"]
            P1[Protection 1]
            P2[Protection 2]
            P3[Protection 3]
        end
        
        C --> P1
        C --> P2
        C --> P3
        
        subgraph DERs["Distributed Energy Resources"]
            S[Solar]
            B[Battery]
            G[Generator]
        end
        
        P1 --> S
        P2 --> B
        P3 --> G
    end
    
    style C fill:#ef4444
    style H fill:#f59e0b

The fatal flaw: The central controller is a single point of failure. Lose it—power loss. Lose it during a fault—protection fails. Lose it during islanding—everything fails.

The Features Nobody Talks About

1. Seamless Transfer Time

Commercial controllers advertise “islanding in <100ms.” What they don’t tell you: that’s the detection time. The transfer time (from grid-connected to islanded) is often 300-500ms. For sensitive loads, that’s an eternity.

2. State Machine Complexity

Microgrid modes:

  • Grid-connected
  • Islanded (planned)
  • Islanded (unplanned)
  • Transitioning (grid → island)
  • Transitioning (island → grid)
  • Black start
  • Load shedding
  • Fault ride-through

Most controllers implement these as nested IF statements. When something goes wrong in state 6 during a transition from state 4 to state 7 while state 3 is active… good luck debugging.

3. The 74HC125 Effect (Again)

Every controller vendor shows you their software architecture. What they don’t show you is the serial communication layer that reads the battery BMS.

Last year, I found a controller that lost battery SOC readings every 47 hours due to a UART buffer overflow in their custom driver. The fix: a hardware watchdog reset. The symptom: random load shedding.

stateDiagram-v2
    [*] --> GridConnected
    
    GridConnected --> DetectingFault: Grid Abnormal
    DetectingFault --> PreparingIsland: Fault Confirmed
    PreparingIsland --> Isolating: Ready to Island
    Isolating --> Islanded: PCC Opened
    
    Islanded --> PreparingReconnect: Grid Stable
    PreparingReconnect --> Syncing: Voltage Match
    Syncing --> GridConnected: Phase Match
    Syncing --> Islanded: Phase Mismatch
    
    GridConnected --> BlackStart: Grid Failure
    BlackStart --> Islanded: DERs Online
    
    Islanded --> LoadShedding: Supply < Demand
    LoadShedding --> Islanded: Balance Restored
    
    note at Islanded
        THIS IS WHERE
        MOST CONTROLLERS
        FAIL
    end note

My Controller Architecture

flowchart LR
    subgraph Edge["Edge Layer - Fast Decisions"]
        E1[Fast DER Controller 1]
        E2[Fast DER Controller 2]
        E3[Fast DER Controller 3]
    end
    
    subgraph Safety["Safety Layer"]
        P[Protection Engine]
        S[Safe State Machine]
    end
    
    subgraph Intelligence["Intelligence Layer"]
        M[Optimization Engine]
        PRED[Predictive Scheduler]
    end
    
    E1 --> Safety
    E2 --> Safety
    E3 --> Safety
    Safety --> Intelligence
    Intelligence --> E1
    Intelligence --> E2
    Intelligence --> E3

Key principles:

  1. Edge nodes make fast decisions (<10ms) for DER control
  2. Protection is independent of the main controller
  3. Intelligence layer only affects long-term optimization, not protection

The Code

class FastDERController:
    """
    Runs on each DER edge node.
    Makes control decisions in <10ms, independent of central controller.
    """
    
    def __init__(self, der_type, params):
        self.der_type = der_type
        self.params = params
        self.safety_limits = self._load_safety_limits()
        self.local_state = "grid_connected"
        
    async def control_loop(self, measurements):
        """
        1kHz control loop.
        Returns control setpoints.
        """
        # Check safety limits FIRST
        if not self._check_limits(measurements):
            return self._safe_state()
        
        # Run local optimization
        setpoints = await self._local_optimize(measurements)
        
        # Check if central controller is alive
        if self._central_alive():
            return self._merge_setpoints(
                local=setpoints,
                remote=await self._get_central_setpoints()
            )
        else:
            return setpoints
    
    def _check_limits(self, measurements):
        """
        Hardware-level safety checks.
        These ALWAYS override any optimization.
        """
        for limit in self.safety_limits:
            if not limit.check(measurements):
                self._trigger_protection(limit.violation_type)
                return False
        return True

Results

After 18 months in production across 8 sites:

MetricCommercial AvgOur Controller
Transfer Time420ms85ms
MTBF14,000 hours52,000 hours
Recovery from Fault8 seconds400ms
Annual Downtime6.2 hours0.3 hours

The downtime number is the killer. Our controller has a redundant edge node architecture—when one fails, the others compensate within 10ms.

What I’d Do Differently

  1. Use IEC 61850 from day one, not Modbus
  2. Hardware-in-the-loop testing for every state transition
  3. Simulate the 74HC125 failure in testing—system designers miss edge cases

Conclusion

Commercial microgrid controllers are fine for demos. They’re designed to show clean transitions and pretty dashboards.

Field deployments are messy. Communication drops. Sensors fail. Operators make mistakes. Your controller needs to handle reality, not just the ideal case.

Build your own, or at least understand the architecture deeply enough to know what happens when their “cloud connectivity” goes down.


Related: Off-Grid System Design and Energy Management Systems

Related Articles