Though not originally designed to function in tandem, high-performance computing (HPC) and artificial intelligence (AI) have coalesced to become a cornerstone of the digital era, reshaping industry processes and pushing scientific exploration to new frontiers. The number-crunching prowess and scalability
Ā
of HPC systems are fundamental enablers of modern AI-powered software. Such capabilities are particularly useful when it comes to demanding applications like planning intricate logistics networks or unravelling the mysteries of the cosmos. Meanwhile, AI similarly enables researchers and enterprises to do some clever workload processingāmaking the most out of their HPC systems.
Ā
āWith the advent of powerful chips and sophisticated codes, AI has become nearly synonymous with HPC,ā said Professor Torsten Hoefler, Director of the Scalable Parallel Computing Laboratory at ETH Zurich.
Ā
A master of stringing various HPC components togetherāfrom hardware and software to education and cross-border collaborationsāHoefler has spent decades researching and developing parallel-computing systems. These systems enable multiple calculations to be carried out simultaneously, forming the very bedrock of todayās AI capabilities. He is also the newly appointed Chief Architect for Machine Learning at the Swiss National Supercomputing Centre (CSCS), responsible for shaping the centerās strategy related to advanced AI applications.
Ā
Collaboration is central to Hoeflerās mission as a strong AI advocate. He has worked on many projects with various research institutions throughout the Asia- Pacific region, including the National Supercomputing Centre (NSCC) in Singapore, RIKEN in Japan, Tsinghua University in Beijing, and the National Computational Infrastructure in Australia, with research ranging from pioneering deep-learning applications on supercomputers to harnessing AI for climate modeling.
Ā
Beyond research, education is also always at the top of Hoeflerās mind. He believes in the early integration of complex concepts like parallel programming and AI processing systems into academic curricula. An emphasis on such education could ensure future generations become not just users, but innovative thinkers in computing technology.
Ā
āIām specifically making an effort to bring these concepts to young students today so that they can better grasp and utilize these technologies in the future,ā added Hoefler. āWe need to have an education missionāthatās why Iāve chosen to be a professor instead of working in industry roles.ā
Ā
In his interview with Supercomputing Asia, Hoefler discussed his new role at CSCS, the interplay between HPC and AI, as well as his perspectives on the future of the field.
Ā
Q: Tell us about your work.
Ā
At CSCS, weāre moving from a traditional supercomputing center to one that is more AI-focused, inspired by leading data center providers. One of the main things we plan to do is scale AI workloads for the upcoming āAlpsā machineāpoised to be one of Europeās, if not the worldās, largest open science AI-capable supercomputer. This machine will arrive early this year and will run traditional high-performance codes as well as large-scale machine learning for scientific purposes, including language modeling. My role involves assisting CSCSās senior architect Stefano Schuppli in architecting this system, enabling the training of large language models like LLaMA and foundation models for weather, climate or health applications.
Ā
Iām also working with several Asian and European research institutions on the āEarth Virtualization Enginesā project. We hope to create a federated network of supercomputers running high-resolution climate simulations. This ādigital twinā of Earth aims to project the long-term human impact on the planet, such as carbon dioxide emissions and the distribution of extreme events, which is particularly relevant for regions like Singapore and other Asian countries prone to natural disasters like typhoons.
Ā
The projectās scale requires collaboration with many computing centersāand we hope Asian centers will join to run local simulations. A significant aspect of this work is integrating traditional physics-driven simulations, like solving the Navier-Stokes or Eulerian equations for weather and climate prediction, with data-driven deep learning methods.These methods leverage a lot of sensor data we have of the Earth, collected over decades.
Ā
In this project, weāre targeting a kilometer-scale resolutionācrucial for accurately resolving clouds which are a key component in our climate system.
Q: What is parallel computing?
Ā
Parallel computing is both straightforward and fascinating. At its core, it involves using more than one processor to perform a task. Think of it like organizing a group effort among a group of people. Take, for instance, the task of sorting a thousand numbers. This task is challenging for one person but can be made easier by having 100 people sort 10 numbers each. Parallel computing operates on a similar principle, where you coordinate multiple execution unitsālike our human sortersāto complete a single task.
Ā
Essentially, you could say that deep learning is enabled by the availability of massively parallel devices that can train massively parallel models. Today, the workload of an AI system is extremely parallel, allowing it to be distributed across thousands, or even millions, of processing components.
Ā
Q: What are some key components for enabling, deploying and advancing AI applications?
Ā
The AI revolution weāre seeing today is basically driven by three different components. First, the algorithmic component, which determines the training methods such as stochastic gradient descent. The second is data availability, crucial for feeding models. The third is the compute component, essential for number-crunching.
Ā
To build an effective system, we engage in a co- design process. This involves tailoring HPC hardware to fit the specific workload, algorithm and data requirements. One such component is the tensor core.
Ā
Itās a specialized matrix multiplication engine integral to deep learning. These cores perform matrix multiplications, a central deep-learning task, at blazingly fast speeds.
Ā
Another crucial aspect is the use of specialized, small data types. Deep learning aims to emulate the brain, which is essentially a biological circuit. Our brain, this dark and mushy thing in our heads, is teeming with about 86 billion neurons, each with surprisingly low resolution.
Ā
Neuroscientists have shown that our brain differentiates around 24 voltage levels, equivalent to just a bit more than 4 bits. Considering that traditional HPC systems operate at 64 bits, thatās quite an overkill for AI. Today, most deep-learning systems train with 16 bits and can run with 8 bitsāsufficient for AI, though not for scientific computing.
Ā
Lastly, we look at sparsity, another trait of biological circuits. In our brains, each neuron isnāt connected to every other neuron. This sparse connectivity is mirrored in deep learning through sparse circuits. In NVIDIA hardware, for example, we see 2-to-4 sparsity, meaning out of every four elements, only two are connected. This approach leads to another level of computational speed-up.
Ā
Overall, these developments aim to improve computational efficiencyāa crucial factor given that companies invest millions, if not billions, of dollars to train deep neural networks.
Ā
Q What are some of the most exciting applications of AI?
Ā
One of the most exciting prospects is in the weather and climate sciences. Currently some deep-learning models can predict weather at a cost 1,000 times lower than traditional simulations, with comparable accuracy. While these models are still in the research phase, several centers are moving toward production. I anticipate groundbreaking advancements in forecasting extreme events and long-term climate trends. For example, predicting the probability and intensity of typhoons hitting places like Singapore in the coming decades. This is vital for long-term planning, like deciding where to build along coastlines or whether stronger sea defenses are necessary.
Ā
Another exciting area is personalized medicine which tailors medical care based on individual genetic differences. With the advent of deep learning and big data systems, we can analyze treatment data from hospitals worldwide, paving the way for customized, effective healthcare based on each personās genetic makeup.
Ā
Finally, most people are familiar with generative AI chatbots like ChatGPT or Bing Chat by now. Such bots are based on large language models with capabilities that border on basic reasoning. They also show primitive forms of logical reasoning. Theyāre learning concepts like ānot catā, a simple form of negation but a step toward more complex logic. Itās a glimpse into how these models might evolve to compress knowledge and concepts, like how humans developed mathematics as a simplification of complex ideas. Itās a fascinating direction, with potential developments we can only begin to imagine.
Ā
Q: What challenges can come up in these areas?
Ā
In weather and climate research, the primary challenge is managing the colossal amount of data generated. A single high-resolution, ensemble kilometer-scale climate simulation can produce over an exabyte of data. Handling this data deluge is a significant task and requires innovative strategies for data management and processing.
Ā
The shift toward cloud computing has broadened access to supercomputing resources, but this also means handling sensitive data like healthcare records on a much larger scale. Thus, in precision medicine, the main hurdles are security and privacy. Thereās a need for careful anonymization to ensure that people can contribute their health records without fear of misuse.
Ā
Previously, supercomputers processed highly secure data only in secure facilities that can only be accessed by a limited number of individuals. Now, with more people accessing these systems, ensuring data security is vital. My team recently proposed a new algorithm at the Supercomputing Conference 2023 for security in deep-learning systems using homomorphic encryption which received both the best student paper and the best reproducibility advancement awards. This is a completely new direction that could contribute to solving security in healthcare computing.
Ā
For large language models, the challenge lies in computing efficiency, specifically in terms of communication within parallel computing systems. These models require connecting thousands of accelerators through a fast network, but current networks are too slow for these demanding workloads. To address this, weāve helped to initiate the Ultra Ethernet Consortium, to develop a new AI network optimized for large-scale workloads.
Ā
These are just some preliminary solutions in these areasāindustry and computing centers need to explore these for implementation and further refine them to make them production-ready.
Ā
Q: How can HPC help address AI bias and privacy concerns?
Ā
Tackling AI bias and privacy involves two main challenges: ensuring data security and maintaining privacy. The move to digital data processing, even in how secure and private our data is. The challenge is twofold: protecting infrastructure from malicious attacks and ensuring that personal data doesnāt inadvertently become part of training datasets for AI models.
Ā
With large language models, the concern is that data fed into systems like ChatGPT might be used for further model training. Companies offer secure, private options, but often at a cost. For example, Microsoftās retrieval-augmented generation technique ensures data is used only during the session and not embedded in the model permanently.
Ā
Regarding AI biases, they often stem from the data itself, reflecting existing human biases. HPC can aid in āde-biasingā these models by providing the computational power needed. De-biasing is a data- intensive process that requires substantial computing resources to emphasize less represented data aspects. Itās mostly on data scientists to identify and rectify biases, a task that requires both computational and ethical considerations.
Ā
Q How crucial is international collaboration when it comes to regulating AI?
Ā
International collaboration is absolutely crucial. Itās like weapons regulationāif not everyone agrees and abides by the rules, the regulations lose their effectiveness. AI, being a dual-use technology, can be used for beneficial purposes but also has the potential for harm. Technology designed for personalized healthcare, for instance, can be employed in creating biological weapons or harmful chemical compounds.
Ā
However, unlike weapons which are predominantly harmful, AI is primarily used for goodāenhancing productivity, advancing healthcare, improving climate science and much more. The variety of uses introduces a significant grey area.
Ā
Proposals to limit AI capabilities, like those suggested by Elon Musk and others, and the recent US Executive Order requiring registration of large AI models based on compute power, highlight the challenges in this area. This regulation, interestingly defined by computing power, underscores the role of supercomputing in both the potential and regulation of AI.
Ā
For regulation to be effective, it absolutely must be a global effort. If only one country or a few countries get on board, it just wonāt work. International collaboration is probably the most important thing when we talk about effective AI regulation.