image: Higher failure probability reduces system availability, jeopardizing reliability via prolonged service time. Admins can mitigate reliability risks using parallelism and sharing allocation. Parallelism increases unit power demands but reduces overall duration. Resource sharing lowers power consumption but may hinder performance. Thus, the performance-power trade-off hinges on unit power and service time, influenced by parallelism and sharing designs.
Credit: Shuyi MA1 , Jin LI2 , Jianping LI3 , Min XIE4
Cloud services have become the backbone of digital infrastructure, underpinning everything from enterprise operations to cutting-edge innovations. Yet, ensuring their reliability, performance, and energy efficiency remains a critical challenge. A multidisciplinary team led by Prof. Jin Li from Xi'an Jiaotong University, in collaboration with City University of Hong Kong and University of Chinese Academy of Sciences, introduces novel modeling to enhance the cloud system efficiency while managing operational costs.
The study illuminates how virtualization technologies—while enabling efficient resource sharing (consolidating services on a node) and parallel execution (distributing tasks across multiple nodes)—introduce inherent trade-offs. Random hardware/software failures can disrupt service continuity, while resource contention among co-located virtual machines and the complexities of parallel task coordination threaten performance and energy efficiency. "The challenge lies in harnessing the benefits of virtualization without compromising system stability " explains Prof. Li.
To tackle these challenges, the researchers developed a stochastic modeling framework rooted in transient-state analysis, a departure from traditional "steady-state" approaches that fail to capture the brevity of typical cloud service windows. This model simulates real-time system states under dynamic hardware and software failure and repair cycles, integrates service allocation strategies, and quantifies three systematic metrics. The system reliability is measured by on-time service completion rates against strict deadlines; performance is evaluated through total execution time, accounting for parallel speedup and resource contention delays; and power consumption is calculated via dynamic node usage.
Using a multi-stage sampling method, the team conducted extensive simulations to uncover task allocation trade-offs. Firstly, parallelization accelerates task completion but increases unit power demands while resource sharing reduces power consumption at the potential cost of slower processing. Additionally, service parallelization increases the exposure risks to hardware and software failures, while resource sharing amplifies the consequences of a single hardware failure. The study also found that workload levels shape optimal strategies. Moderate workloads benefit from parallelization, which enhances both performance and energy efficiency. In contrast, high workloads require resource sharing to avoid capacity overloads, even with slight performance trade-offs. In low-availability environments, cautious resource sharing is essential to mitigate fault risks and maintain service reliability.
"Our analysis treats reliability, performance, and power consumption as interconnected metrics, not competing objectives," explains Prof. Li. "By quantifying the effects of resource sharing and parallelization, CSPs can dynamically optimize allocations to meet service-level agreements while cutting costs."
Beyond immediate applications, the study inspires new directions for cloud operation, including integrating artificial intelligence for real-time fault prediction, designing hybrid architectures that balance scalability and efficiency, and advancing green cloud initiatives through intelligent service consolidation. By providing a unified framework to analyze and optimize cloud systems, the research equips stakeholders to build more resilient, efficient, and sustainable digital infrastructure for the future.
Journal
Frontiers of Engineering Management
Method of Research
Experimental study
Subject of Research
Not applicable
Article Title
Cloud-integrated cyber–physical systems: Reliability, performance and power consumption with shared-servers and parallelized services
Article Publication Date
23-Apr-2025