DISCES: Distributed SCE Scheduler Tools (SCE: Smart City/Cloud Engine), a DISIT tool for smart environments

A typical major requirement in a Smart City/Cloud environment consists of an engine for distributed task scheduling. In this context, DISIT lab developed an efﬁcient solutions for Smart management and scheduling, Distributed SCE Scheduler, DISCES. DISCES consists of a set of distributed instances of running agents performing concurrent tasks. DISCES engine with cluster functionality allows adding distributed nodes and deﬁning jobs, without service downtime.

Part of DISCES is released as open source on GITHUB: SCE, Smart City/Cloud Engine interface

Smart City/Cloud Engine (example on Km4City, only Locally accessible at DISIT lab, book for a demo)

DISCES Cluster Monitoring (only Local accessible at DISIT Lab, book for a demo)

The DISCES is a core component:

can be deployed on one or many Virtual Machines to create a distributed scheduler, where the single nodes automatically takes their jobs independently without any scheduling central services. The architecture is fully scalable and fault tolerant.
can be used for cloud management as in ICARO solution and project, for smart city engine as in Km4City and Sii-Mobility, as cultural content management system as in SMST project.
can be connected to a knowledge base (RDF stores), MySQL databases, noSQL databases for gatering data to take decisions and for in/out processing
activate any inside or outside/detached processes, thay can be direct executable on the operating system as well as called from REST invocations. Processes and can be classical ETL, verification and validation processes, Hadoop management, SLA management, etc.
where each scheduling job includes a name and a related group, a ﬁre instance id, a repeat count, a start and an end time, a job data map, a job status (i.e., ﬁred, running, completed, success, failed), and some associated triggers with their metadata (i.e., name and group, period and priority of execution).
supports both concurrent a non-concurrent schemes for jobs, allows a direct monitoring of each job activity with a push interface, reporting the current status of the job, and the number of successes or failures in the last day or week, with relative percentages.
variety of the hardware at disposal and the jobs to be scheduled require best practices for adaptive job scheduling. For example, a reconﬁguration process written for a particular CPU architecture should be bounded to run on a certain set of scheduler nodes only; nodes with high CPU load could reject the execution of further tasks, until their computation capacity is fully restored at acceptable levels; more in general, there could be the need to assign certain selected tasks only to nodes with a certain level of processing capacity.
mechanisms that allow scheduling tasks in a recursive way, based on the results obtained in previous tasks. For example, a reconﬁguration strategy consisting of various steps could require taking different actions on the basis of dynamical parameters evaluated at runtime.
allows adaptive job execution (e.g., based on the physical or the logical status of the host), and conditional job execution, supporting both system and REST calls. The user can build an arbitrary number of job conditions that must be satisﬁed in order to trigger a new job or a set of jobs, or can even specify multiple email recipients to be notiﬁed in case of a particular jobs result. By combining an arbitrary number of conditions, it is possible to deﬁne complex ﬂow chart job execution schemes, for the management of different cloud scenarios. A trigger associated to a conditional job execution is created at runtime and it is deleted upon completion. It is possible to deﬁne physical or virtual constraints (e.g., CPU type, number of CPU cores, operating system name and version, system load average, committed virtual memory, total and free physical memory, free swap space, CPU load, IP address), that bind a job to a particular schedulers node. Smart cloud best policies require services and tools to collect and analyze huge amount of data coming from different sources at periodic intervals. Virtual machines typically consist of hundreds of services and related metrics to be checked.
SLAs often deﬁne bounds related to services or groups of services that consist of many applications, conﬁgurations, processing capacity or resources utilization. It is worth noting that collecting such a high number of data could lead to unmanageable systems, even if adopting the best practices of DMBS management or clustering, in a short period of time. For this purpose, it includes support for NoSQL, with the aim of allowing high performance in data retrieving and processing.
includes event reporting and logging services, for a direct monitoring of the smart cloud infrastructure and the activity status of every cluster node, and notiﬁcations about the critical status of a system or service (e.g., sending of emails). Notiﬁcations can be conditioned or not to the results of execution.
includes a web interface that allows monitoring the status of the cloud platform (i.e., hosts, virtual machines, applications, metrics, alerts and network interfaces), with details about the compliance of metrics with respect of the SLA, and a summary view of the global status of the cluster nodes (e.g., memory, disk, swap).
provides graphs of all the relevant metrics in order to perform deep data analysis.
performs SPARQL queries to the Knowledge Base to check the coherence of the services with respect to SLA and eventually instructs with a REST call the CM to take reconﬁguration actions (e.g., increment of storage, computational resources or bandwidth).
includes a logging service for registering every event related to the monitored services, and allows adjusting checking periods for each service.
allows to deﬁne policies to apply in case of misﬁred events (e.g., reschedule a job with existing or remaining job count), and allows to produce detailed graphs for every metric (grouped per VM or not), with customizable time intervals.
reports for each metric the total amount of times it was found to be out of scale, with respect to the total number of performed checks. Logged metrics report the list of SLA violations occurred in the selected time period, with relevant data (e.g., the time at which the violation occurred, the name of the metric, the registered value, the threshold, and the related business conﬁguration, virtual machine and SLA).
Reports a global view of the cluster status and detailed views of each node. It is possible to monitor parameters such as last job execution time, number of jobs processed since the last restart, CPU load and utilization, time of last check, free physical memory and the total consumed computational capacity of the cluster (e.g., total CPU utilization, total capacity in terms of GHz and percentage of consumed capacity, total and free memory).
presently released in open source for the web user interface part...