Benchmarking and validation of prompting techniques for AI-assisted industrial PLC programming
Industrial automation in Industry 5.0 demands deterministic, safety-compliant PLC code across heterogeneous
vendor ecosystems. Prompt-engineered large language models (LLMs) offer a path forward but require reproducible methods and rigorous validation. This study introduces LLM-PLC-AS, a hybrid, prompt-invariant
framework for IEC 61131-3 PLC code generation addressing these needs. We benchmark 21 fixed prompting
techniques on 25 real-world use cases (simple, medium, complex), using a standardized dataset and workflow
spanning Siemens TIA Portal and Beckhoff TwinCAT. The quality of the generated code is evaluated through
a layered validation pipeline: Bilingual Evaluation Understudy (BLEU) for lexical similarity, LLM-in-the-Loop (LITL) for scalable semantic checks across four dimensions (functional correctness, readability, safety compliance, and modularity), and Human-in-the-Loop (HITL) for expert safety-critical review. DeepSeek and
Gemini 2.5 Pro generate ST/IL; syntax is cross-checked by ChatGPT-4o and Copilot Pro. The framework
achieved a very high degree of accuracy, with Structured Text (ST) programs reaching near-perfect scores
and Instruction List (IL) programs also performing exceptionally well on our scoring rubric. This resulted in
a substantial reduction in manual correction effort, decreasing it by nearly half compared to ad-hoc methods.
Across tasks, our approach led to a more than twofold increase in Safety Compliance and a significant
improvement in Functional Correctness against unstructured baselines. A key finding is that the structure of the
prompt itself was found to have a greater influence on determinism and correctness than the choice of LLM. The
fixed-prompt reasoning combined with the BLEU/LITL/HITL validation stack provides a scalable, reproducible,
and safety-aware method for PLC code generation. BLEU is utilized for rapid lexical triage and regression
tracking, LITL provides structured semantic verification, and HITL ensures final compliance. The framework
establishes a standardized basis for AI-assisted PLC programming and transparent benchmarking. Future work
will extend the pipeline to include graphical languages, such as Ladder Diagram (LAD) and Function Block
Diagram (FBD), using multimodal/graph-aware models, and will incorporate runtime validation to further close
the gap to real-world deployment. Safety verification in this study is limited to logical and semantic validation.
Real-time behavior, communication latency, and physical safety-fault recovery require Hardware-in-the-Loop
(HIL) simulation or deployment on industrial test benches, which is identified as future work.
Preview
Cite
Access Statistic



