AI and machine learning can reduce the number of false alerts connecting operations personnel, speed up troubleshooting, and help developers and architects understand and manage rapidly changing, cloud-based IT environments.
But CIOs shouldn’t expect what some clients call “magic” results, such as automatically predicting and fixing any potential IT problem, or even just accepting and analyzing any history or event without any data cleansing or normalization.
AIops is the use of artificial intelligence to manage, improve, and secure IT systems more quickly, efficiently, and effectively than manual operations. Market researcher Gartner estimates that the AIops . market Ranging between $900 million and $1.5 billion in 2020 At a compound annual growth rate of around 15% between 2020 and 2025. Along with standalone AIops platforms, many IT monitoring, management, and monitoring tools are integrating with AIops platforms or adding AI capabilities to their products.
AIops is best, according to customers and analysts, at quickly scanning vast amounts of data from hundreds or thousands of sources to filter out the most important alerts or identify key trends, as well as quickly discovering new items like APIs that connect apps — those “things.” that human intelligence can no longer handle,” says Sean Mack, CIO and CIO at Wiley, the global leader in research and education. It is ideal, he says, to provide insights into IT issues among the “exponential growth in the complexity of our systems and services,” with hypothetical elements that “may be there in one second and there may not be another.”
But AIops’ efforts can fail if companies do not understand their limitations.
Where AIops excel
Define patterns. A common and successful use of AIops is to reduce “noise” from alerts that either repeat other alerts, reflect normal changes in the IT infrastructure, or do not affect critical business operations.
says Stephen Elliott, group vice president at market researcher IDC. It can also identify recurring issues such as servers being overloaded to help operations personnel apply a fix before issues affect users. He says linking multiple alerts to a single underlying issue can also reduce the burden on operations staff and speed up root cause analysis of issues.
While ‘Early [its] The AIops “Using New Relic’s monitoring platform, drug distributor AmerisourceBergen has seen a two-thirds reduction in needless alerts, allowing its engineers to focus on critical issues, better prioritize incidents, speed up root cause analysis and increase application availability, as “At Wiley, Mack’s crew used Dynatrace AIops capabilities to reduce the number of false positives by more than 50 percent. When problems occur, Wiley reduced the average time to resolve them by more than 37 percent, which Mack calls” A huge, huge improvement.” All this allows his team, he says, to devote more time to improving the customer experience and offering new and innovative services.
Monitoring and tracking. AIops can also make it easier for operations personnel to track changes in their IT environment, monitor their performance, and cost-effectively manage larger environments. “We’re currently in the middle of a big acquisition,” Stewart says. “By leveraging AIops, we can take on an additional monitoring burden without significantly increasing staffing.”
Airport parking provider Park ‘N Fly uses the Dynatrace AIops platform to monitor its IT infrastructure as well as APIs that provide information from partners, such as those that allow customers to track the location of their shuttle buses and purchase maintenance for their vehicles while they do so. Re-travel, says Senior Director of IT Ken Schirmacher. Dynatrace also automatically detects new components such as Park ‘N Fly hosts in the cloud, and “analyzes their behavior such as the data they access and other applications to which they send that data,” creating a web architecture that tracks how its IT components and adds that the infrastructure integrated.
Mac says that one use of Wiley’s AIops is to manage event logs not just for monitoring, but to understand the reasons behind the availability and reliability of their systems. “Surveillance has become obsolete,” he says. What he needs is “the ability to observe, i.e. the ability to ask questions and get answers. Monitoring may show you the response time (for systems) every second but the question I want to ask is “why is one user in Timbuktu having a problem?”
Getting to the root causes. AIops is also useful for speeding up root cause analysis of problems, helping to determine “in which layer of the service map is (the problem) — in the browser, in the database, in the code (or) a hypothesis network problem?” says Elliott. Wiley links data from all layers of the application stack, including database and application performance and how users experience their applications and services, and has used Dynatrace and other tools to achieve a 40% reduction in average time to resolve problems. “This means serious improvements in the performance of our customers,” he says.
Many customers have warned that AIops requires configuration and often will not result in short-term cost reductions. “You won’t see savings up front” during the implementation phase, Schirrmacher says. “The benefit is largely down the road when you need fewer employees to manage your growing environment, to run it optimally, and you no longer need to schedule employees for late-night updates or to resolve outages, or to schedule updates on holidays.
Where AIops falls short
Dealing with data deficiencies. The more high-quality data and data, the better the machine learning algorithm can understand and analyze the workings of the complex IT infrastructure. The lack of such data, or limitations on the data that an AIops platform can leverage, can limit the effectiveness of AIops, making appropriate data management a critical component of AIops’ success.
“Our early efforts at AIops suffered because vendors were unable to deliver on their promise to accept our ‘chaotic’ data and use it to identify anomalies and problems within the IT infrastructure,” says Vilius Ellikas, Head of Service Reliability and Observability at Danske Bank. Danske Bank sees “high potential” in its use of the StackState observation platform to automatically aggregate, link, and tagdata so our systems can see which infrastructure components support applications and services,” he says. This helps Bank “get the basics right before we get into the magic of learning” automated”.
Notified, which uses a cloud-based infrastructure to provide connectivity and hosting for company events and communications, is running the first proof of concept for AIops using AIops capabilities at Splunk and New Relic, says CTO Thomas Squeo. While AIops is useful for accelerating root cause analysis and event aggregation, he says, Notified still collects the historical performance data needed to predict how much cloud resources you need for large-scale events like investor relations conferences.
Standardizing required operational data about its infrastructure was important to AmerisourceBergen. “One of our biggest pain points was having isolated environments looking at the toolkit and the areas that support it rather than the overview,” Stewart says. “Now that we have all the data in a central location, our AIops engine can link alerts from different sources, allowing AmerisourceBergen team members to quickly focus on the underlying problem. By connecting all the data in one place, we can begin to identify patterns that represent Early warning signs of a fermentation problem.
Automated repair. Fully automated treatment of security, performance, or other issues is another area where AIops can fall short of vendor promises. Gregory Murray, senior research director at Gartner Inc.
He adds that some risks are difficult or impossible to predict, such as exploiting a previously unknown vulnerability. “It is also impossible for any AI system to evaluate all combinations of changes to the IT infrastructure and reliably predict the impact of those changes.”
“Some IT organizations are starting to automatically weed out what they fit into their treatment,” Elliott says. In some cases, it is an “explosion of new services or new infrastructure” to prevent performance degradation as transaction loads or needs increase, while in other cases services may be automatically migrated to a different AWS region or a different set of resources.
Notified is currently automated processing only 20% to 25% of its app portfolio “…based on a risk profile,” says Squeo.
The transformation of culture in the future
For some, AIops is not a discipline in itself more than another tool for IT and agile business processes. IDC calls it “IT Operations Analytics” and in Notified, “we don’t use the term AIops,” says Squeo. “We use the term ‘devsecops’ which assumes good practices for monitoring, notification, events and utilization of AIops as part of the overall collaboration between development, operations and security.”
At Wiley, AIops is part of a broader move to give more responsibility for application and quality of service to the teams they develop. “We take an approach that devops[in]our reliability and management,” Mack says. “Ultimately, the accountability is (with) the teams that build the systems” who are most at risk in how they perform in production.
Stewart anticipates that AIops will eventually facilitate a “team-wide cultural shift, where automation becomes the focus” rather than manually responding to a problem as it occurs. “As we mature, the focus will be on viewing the environment from a service perspective that will bring together application components, infrastructure, and business drivers.”