In this episode of SwarmCast, our CEO Harel Boren is joined by Dr. Andriy Burkov, PhD in AI, author of The Hundred-Page Machine Learning Book and The Hundred-Page Language Models Book, and one of the most respected voices in the global AI community.
Join us as we explore:
1. What inspired Andriy to translate complex AI theory into practical, digestible knowledge for practitioners around the world?
2. How platforms like SwarmOne Autonomous AI Infrastructure Platform are reshaping the way AI is developed and deployed at scale?
Listen
Watch
Read the Full transcript
Harel Boren: Welcome to a Swarmcast, a podcast that’s dedicated to exploring key AI and data science topics with industry leaders. Swarm1AI is the autonomous AI infrastructure platform for all workloads from training through evaluation and all the way to deployment. It is a self setting and self optimizing platform and it works across all compute environments whether on prem or in the cloud.
So I’d like to introduce today’s guest, renowned guest I should say, Dr. Andriy Burkov, a renowned professional figure in the AI field and author of the famous books the 100 Page Machine Learning Book considered by many include my, including myself, a must have for any professional in the field and the 100 page language model books or 100 page language models book which in my own personal view is actually the best book written on the subject at such length. and you will see for yourself. Almost needless to say Andriy is a recognized voice in the AI international community. So welcome Andriy.
Dr. Andriy Burkov: Thanks for having me.
Harel Boren: Real pleasure.
You’ve written several books on machine learning and recently published a second one, so let’s kick off with a quick introduction of your own background and expertise.
Guest introduction
Dr. Andriy Burkov: Well yeah, I made my masters and PhD in artificial intelligence back in 2005-2010. so my specialization was game computational game theory and multi agent systems. so we worked on agents before everyone started to talk about them like, like a decade before. and then I, when I finished my PhD I joined Fujitsu as a scientific consultant. and then I worked for more than a decade for a company that then has been acquired by Gartner as a product or that does the talent, marketplace Analytics. so I was responsible for leading a machine learning team where we solved problems like information, extraction, from, from documents, classification, data normalization. So it was all applied on a large scale of nearly 20 million documents every week. So I, and then also I puEblished several books on machine learning and as you said recently on language models. so like I try to you know to diversify my, my interest. So ask me any questions and we will see where it goes.
Harel Boren: Well, very impressive background certainly. And I can attest from my own account that the, the books are nothing less than lovable, I should say and you already know that so I will know. I will try not to repeat it. I am as you know I’m a great fan of your 100 page book format. I think it’s an innovative way to approach the whole issue and I find it sitting exactly on the overlap, very much needed overlap between professional useful and concise, which you kind of never get together. It’s either non professional and therefore not useful but concise and the other way around etc. And I’m kind of thinking to myself what motivated you to distill the complex machine learning concepts into concise accessible concepts, formats like the hundred page books. well I think it’s a, it’s.
Dr. Andriy Burkov: An intersection of my two two personal feelings. The first feeling is that I struggled to myself learn on the topic and I was frustrated about it. so as you said some some texts are too high level, some are too academic. So I just needed something you know like you can start and finish in a weekend and have a good internal model of what what it is. So not just what you can do with this without understanding how it works or not just understanding how it works without knowing how to apply. So this was the, the first thing and I think the second thing is that I really like challenges and explaining something complex simply it makes me feel proud of myself. So okay, I spent so much time and so much effort and now I can write a very detailed but still understandable overview in a matter of several pages. Like for me it’s like a knowledge distill, pure, pure knowledge distillation if you want. So yeah, so it was like frustration plus challenge that motivates me.
Harel Boren: Yeah it was, it’s kind of if I may call it a knowledge cookie. as many books are, I find them to be in some cases they’re very very good books but they’re just too large and they become a challenge in its own right. And you mentioned a weekend and I’m just considering you know what can you do in a weekend then you certainly cannot cover a 600 page book in a weekend. so then it becomes a challenge, you know going from weekend to weekend. And I really, I think that is a real achievement and a very useful one. yeah.
Dr. Andriy Burkov: And also just to add to this is that like six years ago it already was a problem for people to find time. and many then professionals working in computer science, in it in software industry, they wanted to know more about machine learning. But at the time it was still not like today it’s overhyped topic but six years ago when I published my first book, people just wanted to enter the field. They didn’t even know whether it will, you know, play out as a something big. They just heard like more and more people like companies like Google saying we apply machine learning here and there. And they wanted to also kind of feel it and even then six years ago, it was hard for people to find time.
So when you open this extension existing book of the time, with 600,000 pages, you just feel like you will never make it. And if you will never make it, you, you, you lose all motivation to, to even start. Like why would I you know, spend weekends, not, not having time with my family, kids, friends if in the end I will abandon. Anyway, so like then it was this challenge. Today it’s a similar but different challenge. Today there is so much information, and this includes chatbots where some people say okay, why do we need even read books if we can just ask questions? And so so much information.
So people lose motivation because they don’t even know with what to start. So because you can start with a chatbot, but what questions to ask, okay, should you trust those, the answers, should you receive some equation from the chatbot and then Google and verify whether it’s true or not? And what is this M straight way from point A to point B in the learning sense. So having a short compact book that claims to cover almost everything, it’s motivating for people.
Harel Boren: I perfectly see that and find it really, really compelling. Also serving as a handbook to go back to when you really want to dig into something which you haven’t been using or haven’t been thinking about for a year or two.
What developments in AI infrastructure or tooling excite you the most?
So that actually brings me, rolls me forward to my next question which is As someone which is so deeply involved in both the theoretical and practical aspects of AI, what developments in AI infrastructure or tooling excite you the most? Currently. And there’s a lot going on. So I think you got to think, well, what are the top ones?
Dr. Andriy Burkov: Yeah, well I think that what excites me the most is to see to what extent we can scale the current language models. Of course AI isn’t you know, ah, restricted to only language models. But today, almost everyone works in one way or another with language models. So I think seeing to what extent you can scale them because it’s like as a scientist you don’t know whether something that works in this size, and then you increase the size 10 times, whether it will provide any substantial kind of benefit. And we have seen that models may you know, passed from 1 to 3 billion parameters in 2023 to today’s trillions of parameters. And we actually see a benefit from it. So this is cool. The second is the context size again. anyone who worked on machine learning before, before language models would say that you know, the longer your input the harder it is for the for the model to kind of pay attention to all parts of this input. And you, you, you I’m sure remember that before ChatGPT, the model with the longest context was BERT and it was 512 tokens.
And everyone told like no because of this quadratic complexity of attention calculation go into longer context is just economically and practically doesn’t make much sense. So like 512, it’s already a good compromise. So find the workarounds, how to work with these models to solve your, your business, your specific business problems. And I as a practitioner, in my team, we, we chunked the text into, into small pieces and we fed those pieces one by one to bert. Then we aggregated those kind of predictions somehow, average majority vote, whatever and we kind of okay, well this is what we can have. But it’s quadratic complexity. What do you want? But now you see like okay it’s thousand tokens, then 2000 tokens, then six thousand, then you know, 32, one hundred and 28, now a million. And Google claims like 10, 10 million tokens. And you kind of start to visualize this long context. And I just posted yesterday like you take a 1 million token context so you, you need to calculate this attention between each of the 1 million tokens.
So it’s like 1 million times 1 million, 1 million dot products each a crazy number. And most people who are not in the domain they will say well it’s a large company, Those are large GPUs, those are expensive, it should work. But if you actually try to implement it, it’s crazily complicated. And not just they arrive at doing this, they arrive at doing this fast like this. So you just submit your million token text and you start getting output like a fraction of seconds later. Yes, it’s, it’s unbelievable. It’s unbelievable even for someone who understands, you know, the the mechanics of it, how they manage to squeeze the maximum from the equipment that they have, it’s really, really fascinating.
So. Yeah. So what I’m really excited about it’s the scale of it. So I know that It’s been already several years that there are no substantial progress in architecture of the thing. So different tricks how to make science simpler, more effective. They tried many of them and today everything is converged to more or less the same architecture. But the scaling It’s really Something that We don’t know where it will lead us and we just sit and we observe.
Harel Boren: Yes. So thank you very much for that.
I actually read your post this morning about the costs associated with token models
I actually read your post this morning and I was rather stunned by by it about the costs associated with an effective 1 million token context model. And I love the phrase that you use this is unprecedented display of generosity because some have billions to kill Google while Google spends billions not to be killed. I actually was planning to ask you Your deeper view about the monster lurking in the dark because Especially in so much as it. It applies to enterprise depend enterprises dependent on LLMs, since This ever growing Monster and ever growing context size, eventually Someone got. Has to pay for it. And This is heading This might, it seems it might be putting us all on a formula one car into the abyss. So Yeah, I’d love to hear a little bit more on your thoughts about this.
Dr. Andriy Burkov: Yeah, well I didn’t see their you know accountants sheets to you know to see the exact numbers. But as someone who actually you know rented real GPUs in the cloud and used them to train language models and we did it in my, in my previous company on a daily basis. So we, we rented a What they call a node. A node, it’s kind of a physical equipment with let’s say eight modern GPUs connected together with a fast, fast communication ports and. And you have it and you put your model on it and you train. Okay. So renting such a node for us was It costed about one ah hundred fifty thousand dollars a year. Okay. So like it’s a huge number. Like you can buy a house, or several houses depending on where you want to live. And we talk about like a small piece of equipment like this and NVID prints them, they have a printer so like Nvidia, it’s today I think that is the the definition of money printer. If you want so like to, to, to to try to, to train something on this equipment. So we rent it, so we don’t even buy it. So this $150,000, it’s a salary, you know, in North America of some, you know, very, very capable software engineer. So we pay for it and we train on it. So this 150,000 should be then converted back into some profits that the company wants to achieve. And no one gives you any guarantee that when you will finish your training you will have some product that you will beat the market with. And you have your team like this. And now because of all the hype, almost any company who can afford to have at least one node like this, they have one. So like it’s billions and billions of dollars poured into renting this equipment. Not even buying, just renting. So in the end this company invested all this money and they trained some models and maybe it was good, maybe not so good. So in the end someone must pay the bill. But what happens today? You see all these products based on LLMs are giving you for free. So like you have free ChatGPT, you have free Gemini, you have free anthropic. Well they have also this $20 a month plan. But most of their capabilities, they, they give for free. And for most users, they don’t need you know, to, they don’t even reach the maximum of of the numbers of questions they, they can ask. Yeah, so like people just get used to getting this huge amount of money for free. Well they cannot you know, withdraw this money and to go to a restaurant but they, they use it, they burn this money like this by their daily activities and those companies pay for it because well if you stop offering the same free service to the customers, they will just go somewhere else. And for Google for example, today I think it’s one of the most uncertain times for Google as a company because no one challenged them in search. Well some tried, most failed probably except for Bing and Microsoft. They managed to build quite capable search engine. But again because it wasn’t substantially better than Google, people decided not to move to Microsoft because well, what’s the point If I already used to the ui, I used to trust that among the Top links, top 10 links, I will most likely find my, my results. So if I go to Microsoft, I’m not sure, maybe it’s Worse and I will miss some opportunities, right? But with those chat models, people say, oh, wow, okay, this one claims that they beat, everyone on math. And this one claims that they beat everyone on you know, instruction following and so on. So people go from one to another. So you cannot just afford to sit and wait when, you know, people will realize that in the end they are all the same and they will stick to Google and no people will really have time to move to something else. So everyone, everyone, throws money, into this fire and I think everyone just waits, for who will blink first, like who will say, okay, I think I’m done with it. I will no longer, burn billions on training those models so trying to make them better. But in the end, people don’t see any difference. So in the end, one will blink, the second will blink, the third will blink. And once everyone blinked, then we will be back to business as usual. And what business as usual means? It means that if you generate costs for the company, they will be billed back to you. And as I started with my $1,000, $150,000 a year rental of a small equipment. So this 150,000 a year will be built to you, the end user, and you will really not like to be built even a thousand dollars, like, leave alone, $150,000. So this a crazy money. People just don’t realize how crazy they are. That’s, you know. So some say, oh, they invested $3 billion. just recently OpenAI bought this text editor for 3 billion, dollars. Like the text editor for kind of a, for software, software developers. $3,303,000,000,000. It’s hard to imagine for, for a normal person, like if you, if you tell to a normal person it cost 1 million, or it cost 100 millions or it cost 3 million. For them it’s all the same. Like, oh, it’s a large number, but I don’t care. No 3 billion. I think, you can survive and your entire family for a thousand generations, in the future, so much money it is so someone in the end m has to pay for it. when this dust settles down.
Harel Boren: Yeah, it brings up actually many questions. how can a moat, be created around such a service? And if a moat cannot be created as why is the money spent? And how do, and maybe that’s one avenue that is interesting to also speak about.
Where does this put Nvidia and other chip manufacturers in the future
And that is your, perhaps your Perception about the prospect of vertical integration in the field of the hyperscalers. because we see if we’re speaking about Google we see that they have the TPU for a long time and aws building in the Trainium and I forget what Azure has but we we see the same players that are playing on the, on the level of the on the ll, the big LLM level, also seeking to to vertically integrate. Where does this put Nvidia and the classic so called chip manufacturers. Where do you think it puts them in the future?
Dr. Andriy Burkov: well this is, this is a good question. How how much of how significantly they are ahead of of anyone else. So yes Google has has TPUs and you see their specs, they are quite similar to, to the state of the art of, of Nvidia. But we don’t know what Nvidia has in the baking. Maybe they have something you know unbelievable ten hundred fold, more, one hundred times more capable. Usually I doubt that such jumps are possible. Sometimes they happen but most of the time they happen by chance. It’s like with ChatGPT, no one thought that a language model can talk and then they just trained 10 times larger models and voila we see that it can talk. So maybe Nvidia will surprise everyone with something really big and the market will spend a decade to catch up. I doubt it but it’s possible. But like even more more realistic scenario is that, is that. Yeah. So today chip production is not as you know kind of a restricted field for a few players who have been there for decades and penetrate this market is extremely hard. Today specialists are bought and sold. So today you work for one chip maker, then they offer you $100 million a year and you go to, and you work to another one for another one. It’s like with cars like when I was born, there were you know brands of cars and they didn’t change. But today anyone can even you know several years ago they they they talk talked about Apple wanted to to build their own car. well they decided not to do it but they would easily be able to do it. Like for example Elon Musk with Tesla. not much history of car manufacturing but in the end they managed to build a quite competitive car not just in terms of price but quality and features and so on. So I think that more realistically this or Maybe next year there will be several companies offering chips and today we already m have amd with with quite capable chips. I think that the only thing that we can see we can, can consider a mode for Nvidia is Cuda. so Cuda, it’s a kind of a firmware if you want for. For GPU. So this is what make GPU GPUs run. Okay so GPUs, it’s, it’s a rock. So you need to put something on this rock so that something happens inside. So CUDA is what Nvidia has built like it’s an API to communicate with the rock. and CUDA is not open source. so like you cannot just clone, you know make a copy and put it on your rock. So like ah Nvidia keeps it for themselves. But catching up on Cuda is already happening and every chip manufacturer offers their alternatives. And this is probably the only reason why companies training language models still prefer to work with Nvidia because CUDA is mature and CUDA is predictable. So when you buy Nvidia equipment and you scale it to thousands or hundreds of Thousands of GPUs connected together you don’t want to fail because your driver fails. There is some issue that only 2 people on Earth can, can fix. So you prefer something predictable. So people for this reason I think continue to buy Nvidia. But if you want to serve models, not to train but you know offer them to the end user, you can serve on anything. Like you don’t care if during service there was some glitch and the person the user lost connection. Everyone understands it’s a like it’s a, it’s a new technology. So you will just reconnect and continue but training and fail and failing and everything stops. This is a significant risk.
Harel Boren: Yes. and let’s keep in mind also that CUDA is is after all a piece of software. And as you said rightfully the. The rocks manufactured by AMD are performing very well. and Rokam is also a piece of software that seems to be working well. So eventually it will probably boil down into a software mode of maybe a higher degree for companies which are Companies which are companies which are actually enabling a faster delivery of models or creating some other moat which actually leads to upper and upper in the stack to the customer experience rather than the basic, because Cuda is a piece of software. But in any case the battle will probably be decided on software and not on the rocks but on the, it’s.
Dr. Andriy Burkov: Usually what’s, what’s happening. and I think that, well again it’s hard to predict today where it will all go, but I see that now the big, the biggest bet probably to create this mode is to transform your model into a code, code generation, engine. So it’s not for nothing that Google tries to beat everyone on the context size. People often ask like why do I need 1 million tokens? Why do I even need said 10 million tokens? It’s crazy. I have no data, you know, to analyze but they, they just don’t think about how fast your code base grows. Especially if we, you, if you use LLMs to generate code. And I can witness, become a witness to this. I spent two days to build a, an application that helps me translate my, my book from English to French. And these two, two first days, it was pure pleasure. You just say what you want and you get it instantly. Like oh, I want these two, two block, two panels to. This one will show my English text, this one will show my French text. And I wanted to you know, to render latex and to render markdown. And I want it you know, to align words to one another using this machine learning algorithm and that machine learning. So cool. You just say give me this and it gives you that. And after two days I have got a quite capable web application with backend front end, authentication database, even billing. But then once and in two days my code base reached about 50,000 tokens. Not a lot, 50,000 by today’s standard. like most models support, they claim to support at least 100,000. Some like Google, support even million. So I was like okay, well I still have you know, room to grow. so I will continue to develop my application. And then I started to realize that when my code base reached about 100,000 tokens, suddenly you don’t advance anymore. You spend time, you say fix this and it provides you fixes, you integrate them, you rebuild the application and it doesn’t work. You say okay, now I have this problem. Fix this. And again it generates you a new code again, you spend time integrating it again, you rebuild. Doesn’t work either. So I started feeling like I waste my time and when I reach 100k, it just stopped. Like you start hating to doing this because you just feel like you do a dumb work. You just copy, paste, copy paste, run. Doesn’t work. So everyone now wants to break this issue. So they say, okay, try our model with 1 million tokens and you can put a really huge application in it and it will provide value. And this is not a hardware issue, it’s a pure software issue. It’s a pure scientific issue. How to make sure that your transformer pays attention to a million tokens. And when you ask a specific question about your application or you ask to fix something specific, it doesn’t get lost in all those millions of tokens and actually provides you the right result despite that you submitted too much.
Most mobile applications today don’t have more than a million tokens
You know, it doesn’t need this entire million tokens to pinpoint your specific problem. But because you don’t want to know where the problem is, you just give it everything. Okay, this is my entire code, my, my JavaScript, my Python, my CSS, my HTML database, everything. And you figure it out. So somewhere today at 100,000 tokens, they, it stops, it becomes, I call it like it becomes dumb or drunk, whatever. Like it just doesn’t give you what you want. So if anyone today, and they all want to do it, they manage to provide you a reliable quota generation, beyond a million tokens, it’s a big deal. most application I think that you use every day, they don’t have more than a million tokens. So if you can at least guarantee to a software company that they can put their entire you know, mobile application and say modify the way I, you know, I generate this form for the user, add this additional field to it and it will do it reliably and you will not spend a week trying to debug because it put this field here but remove something there. So yeah it’s, it’s a significant market there. So if you manage to be reliable at a 1 million token threshold.
Harel Boren: So we’re back to software and we’re back to the asymptomatic, not absolute, asymptotic manner, nature of of diminishing returns not only on the hardware but also on the context side. very, very interesting, thank you.
What were the primary infrastructure challenges when working with AI models at scale
Andriy in your experience leading AI projects, at companies like Gartner, you mentioned and Talent Neuren, what were the primary infrastructure challenges that you faced, when working with AI models at scale? and I imagine you should kind of do an extrapolation from then to now. But, And in order to kind of update, your answer, I’d love to hear your thoughts.
Dr. Andriy Burkov: Well, I don’t think that then and now are significantly different. Maybe the numbers, specific numbers are different, but the trend and, you know, the nature of the problem is the same. As we started, today we talked about this node with eight GPUs. you cannot put much on it. despite the fact that it’s $150,000 a year. you probably will manage to train a model of 20, 50 billion parameters. and your context size would be somewhere between 15 and 30,000, tokens. So today we are kind of spoiled by this, unprecedented display of generosity because we say, well, with, with. With commercial models, I can, put hundreds, of thousands tokens in my input. And the quality of output is this. So I want to have the same on my own equipment because, well, I want to train my own model. But I want it to be. I don’t want it to be less, you know, less capable. I want it to be good for my business, my business problem. This is why I bought this expensive equipment. So I will train it on my data, but I expect from it long context, being fast, you know, and have many parameters so that the quality is there. And here you realize that it’s not a lot, those eight GPUs. Why? Because, the models are huge. So they, they take a lot of memory on the gpu. So if you want to train a model, above, let’s say, 7 billion, parameters, you need to split it, among the GPUs. So there are libraries for this, like DeepSpeed, and others. And they do it. And then you say, okay, now I split my model between, these GPUs. Now I want to train. So to train, you need those GPUs to still have memory to put, your inputs in it. And when you train, you also need the gradients to propagate. So then you realize that, the maximum of, the context size that you can actually fit on your equipment, it’s 15,000 tokens. Not, 150, not 1 million, 15,000. And then you see, okay, what can I fit into? In this 15,000, for example, you work on some industry specific, chatbot for your specific industry. I don’t know what it is. Healthcare, warehousing. Legal. Yeah, legal, whatever. So you want the conversation between your user and your model to last for. For. I Don’t know, three, five minutes at least. So the user should be able to, ask additional questions, understand that the model didn’t provide the right answer, correct it, and so on. And this conversation normally, again, based on my experience, if you want your Chatbot to be industry specific, it should be trained to call different APIs or data sources that are proper to your industry.
Harel Boren: Yes.
Dr. Andriy Burkov: So in the conversation log, it’s not just user said that and chatbot responded that, but user said that. You took, its input, you normalized it, you extracted, for example locations, order numbers, whatever. Then when you extracted all these pieces, you called this API, the API returned this answer. So you presented it to the user, and the user continue. So you add, not just text, but you add, structured information, like for example, in the form of a JSON object, with different fields, inside, and so on. So this conversation grows quite fast from one turn of conversation to another. So you realize that those 15,000 tokens for this structured conversation, it’s not a lot. So then you say, okay, I spent this amount of money on this node and I cannot train a competitive, model because 15, 000 tokens, everyone will try. Yeah, it’s not enough. People will say, well, too bad you didn’t solve my problem. And now you say that I need to start from scratch. Not a good chatbot. So this is, this is where the problem is. The problem is, is that today to be competitive in the enterprise, you need access to the hardware similar to the hardware used by Google and OpenAI, and others. And this we are starting to talk about not hundreds of thousands, those are millions, and maybe even hundreds of millions, of dollars. So this kind of a leap between what’s achievable and what the end user expects for many companies, it’s really, really hard to cross. So this is where I think the biggest challenge in the enterprise right now.
Harel Boren: And do you believe that this challenge is potentially addressable by fine tuning? merely by fine tuning?
Dr. Andriy Burkov: I’m talking about fine tuning. Yes. But I think the problem is that if you want to offer your enterprise customers the same, level of user experience, like long conversations, very fast and so on. So you need to access, the equipment that probably you cannot afford. There are companies that offer you to pay per token, even for training. So you can say, yes, I will upload my training data and you will train a chatbot, based on my data that can talk, for hundreds of thousands of tokens, without, you know, without reaching the limits. So this kind of infrastructure starts to emerging and maybe this will be the answer. But so far most companies try to you know, to do something on their own equipment and they feel the pain because they cannot really work with state of the art models anymore.
Harel Boren: Yeah, got you. Definitely.
How important do you believe it is for AI practitioners to understand network infrastructure
okay, let’s kind of turn to well touch base on another. On one of the issues we already talked about and I’m speaking about your books and they clearly aimed among others objectives to bridge between the gap between theory and practice. and we talked about it. How important do you believe that for AI practitioners, it is to understand the underlying infrastructure and tooling both on the level of the, of the hardware as well as the level of the models themselves. the, the, the the, the, the optimizers, the activation functions, etc. So how important do you believe it is for AI people in addition to the mere algorithms and the models on the high level frameworks? That’s a question that bugs me by the way, in every conversation that we have on Swarmcast.
Dr. Andriy Burkov: Well, yeah, so I always like answer this question the following way. You don’t need to know how it works, as long as you get what you want. So you start training, you have a complete ignorance of how it works. behind the scenes you just put your data set. Remember this scikit learn model. fit and it fits. If it fits and you get what you want, who cares if it was logistic regression or it was random forest or it was gradient boosting. It works. That’s enough. The problem is when it doesn’t work and unfortunately in machine learning most of the time it doesn’t work. so you can of course again say I don’t need to know how individual algorithms work. I just need to know that I have a choice of them. So you will just blindly switch from logistic regression to support vector machine to gradient boosting to random forest and works. Cool. That’s it. You’ll do it but with new.
Harel Boren: Yeah, a grid search on, on the modeling.
Dr. Andriy Burkov: Yeah, grid search on hyper parameters. So like a really you know, high ah, level understanding. Yes, but with neural networks it’s not like this. In neural networks there is no like. Okay, I will switch from this neural network to that neural network. All neural networks they are m. The same in the sense that it’s like a spectrum. There are different architecture like convolutional recurrent, transformer. But within each architecture there are so many, decisions that you can make in terms of how deep, how wide, skip connections, batch normalization. Drop a lot of stuff, yeah, drop out, regularization and so on. So like, you cannot just, you know, define a grid search with, hundred combinations of different parameters and wait for the results to be given to you next morning. So you go to sleep and next morning you get it. Now with these large models, you can spend a week, fine tuning it on a very like, relatively small data set. So if after a week it just didn’t learn anything. So what do you do? You cannot say, okay, I will just tweak this small thing and I will wait another week. Eventually your, your manager or your customers will tell you, come on, where is the, where is the result? And because you have a total ignorance of like, what you work with, you will just, you know, say, I don’t know, maybe, maybe never. So, like, to avoid this situation, and it, as I said, like, it will happen most likely to, to anyone, it’s better to at least, you know, have an understanding of how information flows inside so that when you try something else next time, this something else, it’s kind of, practically different from what you already tried. Okay. So, and to actually make this choice, you don’t know whether the second, try will work, but at least you know that you try something substantially different from what you tried before. And this is where I think you need to understand the mathematics of, how things work. Because every piece of a neural network architecture is responsible for solving some mathematical issue. So they didn’t add the batch normalization for nothing. They didn’t add, dropout for nothing. They didn’t add skip connections for nothing. They didn’t change, sigmoid to relu for nothing. And then they didn’t invent leaky relu, for nothing. So there were always reasons why they did it because they wanted to solve some mathematical inconsistency that existed, in the past. So when your model doesn’t work, when it doesn’t learn, it’s one of those issues and you need to know about them.
Harel Boren: Yeah, I can’t agree more. I can’t agree more. And I myself was very, very intrigued for a long time about activations, functions and the role, which are treated, very, you know, without regard. Okay, let’s use reload. Everyone uses relu, and in my opinion, I feel that activations, functions, activation functions hold much more implication on the results of product of models than they are regarded normally.
Dr. Andriy Burkov: just think about initialization like the starting values of your tensors. When you train something from scratch, they’re hugely important. They decide what will be the end values. So if you aren’t aware of the issue itself, the initialization, you will just ignore it. You will just take a tensor given by Pytorch, you don’t know what’s inside and you start training and in the end it doesn’t work. Well, you don’t even think about that initialization might be an issue because you didn’t, you, you didn’t care to learn the principles.
Harel Boren: Absolutely. and I think that this is for. For the community, this is a point to take, to take to heart, the understanding of the mathematical substance, stands behind what is so simply put by by in the past, TensorFlow, but currently with Pytorch and hugging face, not understanding the basics of it, is. Is resulting in much blood, sweat and tears, spent in vain. And in. In many cases the, the understanding can actually promote you with far less efforts, and far faster to. To good results. so.
Your very strong presence on LinkedIn and the social AI community
But I think that we can go on that forever because one of my favorite topics, if, if we jump for a moment to your. Your very strong presence on on LinkedIn and the social AI community. how do you perceive the role of social platforms in shaping the discord around AI development and, and deployment practices? I think I can guess the answer. But nevertheless, speaking with you, it, it, it’s. It makes me very curious.
Dr. Andriy Burkov: Well there are good things and there are bad things, like in everything. So the good things in social media is that you can learn, you can follow the topic quite effectively because today so much things happen, so many things happen. So like I think there was a picture circulating last year that on Arxiv, I think every month, about 5,000 news or even it was every week 5,000 new submissions, in the AI domain, happens. So it’s impossible today for any reasonable human being to read those 5,000 papers every week. Impossible even like reading one a day, it’s too hard for the brain. I did my PhD. I know. So like seven papers a week, it’s your maximum. And there are 5,000 of them. So impossible. So social media they kind of, amplify what’s important. Of course something can be missed by the community, but, usually if there is something really groundbreaking, it starts circulating. So, for me, like this, this one of the reasons why I’m very, often present, on. On social media, especially Twitter or X, people, you know, share, share a lot. And there is. There are interesting conversations and so on. there. This is a good thing. The bad thing is that since ChatGPT, millions of people who previously had nothing to do with AI and machine learning, they tested ChatGPT and they kind of felt that, oh, now I understand AI, okay, So I can ask questions, it gives me answers. I can even, craft very special prompts. You remember the last year everyone talked about prompt engineer being the profession of the future. No one talks about it anymore, but a lot. Like millions of people suddenly converted themselves from, I don’t know, blockchain expert now into AI expert and everyone has an opinion. And, now that there are kind of a, I don’t know how they call them the mobs or like people organized around some idea. Yeah, yeah, people organize around some idea. Like there are people who say AI will kill us all. And unfortunately there are even, you know, scientists, who now go from one TV channel to another and say that we are all doomed and, our future is dark. There is another mob that says the progress shouldn’t be stopped. And, no matter what, the progress should continue. It’s always better for us, on the historical scale, the progress over the time was beneficial for the humans. And there are people who wait, that next year there will be this, artificial general, intelligence, intelligence breakthrough. So like, there are so many people who don’t really. Who cannot really write, any equation, correctly because they don’t even know what an equation means. And all of them suddenly are experts. So it’s hard now, to have a, kind of good quality discussion, around some topic, when you so much people come and just leave their unsubstantiated opinions, in this. So like, I sometimes I try to ignore, such people, but sometimes I, really feel like, okay, no, this is crazy. And Yeah. So, like as I said, it’s important as a way of getting signal from noise, but the quantity of noise, it’s also like 100 times compared to 3 five years ago.
Harel Boren: Yes, I must agree. And frankly, I find some similarities between this and, how universities, and the general, information conveyance has evolved in history. after all it started by willing to put all the information in one place because people could not have access to it. And then all the people could come to one place and absorb the information which was accumulated in an organized way and set degrees for it so that they make sure that they, that they’re assured to actually receive the receive the, the the expertise and experience. but AI is working and Has proven in my opinion to actually have gone Taken a different path because it all proceeds in such a pace that there is not enough time for the universities to actually absorb what is out there. By the time it is absorbed it is old history. And then actually the the studying is made. you know, if you want to study then just hit the road, go online and don’t sleep for 300 days. And that’s an option. I went through it myself. But then it all becomes a cowboy movie because it’s, it becomes a Western because you can absorb a lot of crap information where the good information is very, very difficult to to achieve. To achieve and accompany accumulate the correctness of the information. Because 5,000 papers a week is very difficult not for even one human being but even for the whole community is very difficult to, to, to find what’s good and absorb what’s good. so there are, there are similarities, there and Yeah, I agree very much with what. I think that’s a big question mark that is facing us.
Dr. Andriy Burkov: yeah.
Students cheat with chatbots on university exams
And talking about universities, right now it’s a huge challenge because students cheat with those chatbots. Why would anyone write an essay? And you know in North America, the education it’s all around essays like okay, write any says on this topic. And then that topic we all, we also did this when I was young and back in Ukraine we also cheated. You know like they asked you to write this essay and I remember there are compact discs circulating with already written essays on some topics. And if you receive this essay to write, you read. you look into this collection on the compact disc and if you are lucky you will find not exactly the same, but kind of similar. So you will take this essay, you will copy different pieces from it, then you will feel it yourself. But at least in the end yes the result was kind of half cheated. how made by you. but at least you know you Read something. Yes. It’s not, it’s not kind of a good education, but still some. But today with chatbots, you don’t even read it. It’s like you don’t even read the code, that NLLM generated for you. So you can, you can code an entire application without knowing anything about any of the technologies that you worked with. And you didn’t learn anything during this, exercise. So you. It’s like you just got some results and, and that’s it. The same about, students. So now they generate this essay and there is zero, zero way for the professor to identify whether it was, generated by a machine or not. So there are, there are services that claim to detect. and I remember a lot of funny stories when, it started two and a half years ago with ChatGPT, when some professor received those essays and he pasted those essays in ChatGPT and said, did you write this? And ChatGPT responded, yes, this one was written by me. And the professor, gave the lowest grade to those students. And they’re like, no, we didn’t cheat. We actually wrote it ourselves. So today the professors understand that it’s impossible to detect even if you read it and even if you feel like, yes, it was written by a chatbot, you cannot prove it. Okay. The student might say, well, yes, I have this style of writing. What do you want? And that’s it. So, yeah, so not only it’s hard to, study because there is so much stuff happening, but it’s also hard to verify that someone actually studied because they just, submit, texts they didn’t read, and you don’t have time to argue and validate, every claim inside. So like today, I think this generation of students, they just feel like, oh my God, it’s so easy now to, to make your. To, you know, to fake your way through, through these three, four. Three four years of, of university. It’s interesting to see, whether, whether university will find some, you know, solution around it because, like, okay, essays. It’s no longer a way to, to, to, you know, to see that the, the. The student worked. Okay, what is the way, mathematics. They will still use LLM, solve this puzzle. Well, LLMs are great at puzzles. What, what can you do? So except, you know, for the final, final exam, yeah, probably if you cheated your, your way through the session, you will probably fail, the exam. So, yeah, probably exams. It’s the only way that, that remains.
Harel Boren: Yeah, maybe the Right thing is just to give you an exam the moment, the day that you start and that’s it. Or a syllabus and then an exam. And beyond that.
Dr. Andriy Burkov: Yeah. Why? Why not? Because if you come prepared and you pass the exam. Okay, yeah, you’re good.
Harel Boren: Yes, you’re good. Nevertheless, you’ll still be passing exams which are dealing with information which is already old by the time that you’re being being examined.
For enterprises looking to integrate AI into their products, what considerations should be considered
I will I’ll go to a complete the different direction in your opinion for organizations that are looking to plot a healthy path, in integrating their AI into or with their products, what key considerations do you believe that they should keep in mind, regarding infrastructure and scalability. And this question actually addresses a growing by day part of I would even say civilization because enterprises are moving into utilization of AI as by. By the thousands. so I believe our audience would be very interested in your thoughts about that.
Dr. Andriy Burkov: well, I think that the, the first, probably thing to, to understand for everyone deciding to use AI in what they do is that Well it’s obvious now everyone says that it is not a human, okay? So you cannot really rely on AI as you could rely on a human. And the second thing is that again only by understanding how machine learning in general works, you understand that it cannot really go beyond the training data that was used to train this model. So the first issue is that like previously for example when we worked on machine learning, we worked on. We trained the model from scratch. So you always started by you know, labeling your data set or getting some data set labeled by someone or by some previous system. So you kind of know what your data set is. And then you can say well my machine learning model that I will train on this data set will be capable of executing this specific task with this level of success that you can measure by. We separate data on the training and test sets. You train on training, then you validate on tests and so on. but now then they invented those pre trained models. Like one of the first was Bird for example. So they train this neural network on a lot of web documents and then you take it and you fine tune it. So you take your small industry or company specific data set and you kind of continue training Bert, until it starts responding in the way that you want it to respond for your specific business, business case. And here it was already kind of a gray scenario in the sense that you know that it performs this with this level of quality for your training set, for your test set that you put aside. But you really don’t know what else it can do because like it was pre trained on the documents that you don’t know, you don’t know what was inside this pre training pre training data set. But it was less of an issue at the time because normally you, you could well you could fine tune BERT only to be a classifier or at the very least you could classify something for each input tokens. And for example we trained it for named entity recognition. So it says from this token to this token this is a location and from this token to this token this is a person person’s name. So what, what danger of misclassifying would, would you know represent. Maybe not much. But now we work again on pre trained models but those models are no longer classifiers where you, you control the number of classes so the number of different predictions it can do. Now it’s generative model. So now it will generate almost like anything in the world depending on the input. And this is where it starts getting dangerous because you don’t know what data set it was trained. You fine tune it only to a small subset of what the model can or could generate and then you put it in front of a user and the user might receive outputs that, that weren’t in your fine tuning data set and you have zero chance to know what it even might be. For example it might be some radical literature that was used to train this model. It might be some Stephen King novel where a machine convince a person to kill themselves. And because this model was trained on this potentially dangerous kind of situations and you put it in front of your users and you say okay it’s fine tuned on my professional data set, use it. But the user will see something entirely different and you don’t know what it might be. So the first thing for businesses to think about is that there is a significant danger of exposing your customers to something. You have zero way to know what it might be. So guardrails are super important and I know that now some companies offer their guardrail models for example meta with llama they started I think since llama 2 or llama 3. they, when they release their their generative model they also release kind of guardrails models. Those models can validate whether the input respects some. Some kind of standards and whether the output also. But again, who trained those guardrails models on what data, based on what optimality criteria? You don’t.
Harel Boren: And this recursion.
Dr. Andriy Burkov: So m. You don’t know. So you kind of. I have to believe that Mark Zuckerberg, thought about it. Well, and the model can be trusted. I wouldn’t, I wouldn’t, Wouldn’t, trust. I would, you know, define my own guardrails. But yeah. So my, My first, and foremost, you know, recommendation for anyone who is the, you know, the first time, thinker about. Okay, let’s Let’s make our business AI first Think about can you actually protect your business and your customers from these sequences of words that those models can generate? And you normally have zero knowledge or zero, ways to predict what those sequences in reality might become.
Harel Boren: Can additional two guardrails would Would, evaluation or batch, inference, so to speak. whilst you are training upon your, Your data set, do you believe that this will have a positive effect or may have a positive effect on the. On the actual final result?
Dr. Andriy Burkov: it’s really hard, Hard to tell today. Like, today it’s a. It’s a wild, Wild, wild west, in. In all this. So like,
The difficulty with AI is that it’s a tool, okay
One other thing that I often, tell, to people who, who ask me how to. How to better use, you know, those language models. the The.
Harel Boren: The.
Dr. Andriy Burkov: The thing is that the difficulty is that AI is a tool. It has always been a tool, okay? So it will not think on its own. Like they call, okay, they call them agents and they think that they will, they will replace people. in the end, this agent, you still need to provide it some desires and it might accomplish them or not, no one knows, but at least you need to be the one who says your goal is X. but AI, given that it’s a tool, we can use it in different scenarios. But the problem with AI is that especially this generative AI is that we don’t know where exactly this tool will be, will be beneficial and where it will be a waste of time and where it will actually be dangerous. So like it’s, for example, you have a hammer, okay, and you buy it in a, you know, hardware store with the hammer. You cannot do much, of what you know, besides of what it was invented to do. So you, you can hammer, clues. And maybe if you use it wrong, you can, you can break, break a glass, you know, or window, with it. That’s it. or you take a screwdriver, okay, well it makes screws. you can also, you know, kill someone with it. but you know, we are, we are so fragile, we can be killed with basically anything. So, so again you can predict, and you can expect what to expect from a hammer and what to expect from a screwdriver. When they give you AI, you don’t know what it’s for. Okay? So like anything you put in it results in some output. And only by inputting something you will see what kind of output you can get. So maybe like when you think about like as a business owner, okay, my clients will ask this and the output should be that, and my clients can ask this and the output should be. Should be that. But what if your clients ask it something you, you didn’t think about, and the AI responded with something you might, you might not predict. Like for example, it was a, case last year with Air Canada, Canadian, airline where the customer asked about a discount. and because they, they had, they had, they had to travel for, for some I think, to, to. To. To. To visit family because some family member died. And they asked like can I, because of this, can I get a discount? And the model say yeah, sure. For these cases we have, we have discount. So buy, buy the full price. And then, we will, we will like we will refund you. Yes, reimburse you. And he was like cool. And then he comes back and he tries to actually request this this reimbursement. and the real person says like no, we don’t have anything like this. So no. And he’s like no, your chatbot told me and he hired a lawyer and the lawyer won, the case because well, what was said by the corporate chatbot, it’s what was said by the corporation. So what else it can say? Okay, this time it costed the company, $800. Not a big deal. But what if the chatbot said you and your entire family and your entire city now can fly for, for free forever. What, what will we will. Will you do if you. If your official website may made this official promise to the customer?
Harel Boren: Yes. Yeah. And this also raises ethical questions. But if we enter this, this alley where we’re not going to get out of it, for, for a long time. So. But many, many, many, there, Okay, thank you for this, and for our audience.
Google has a very good, I would say information information bubble algorithm
and in addition to your fantastic books which I mentioned, what would be the top resources that you’d consider a must for professionals that are wishing to maintain kind of a healthy flow of useful information for their day to day challenges and objectives and who cannot read 5,000 papers every week. what would be your, your recommendations?
Dr. Andriy Burkov: Well, I can only say how I do it because well probably everyone has their own recipe and there is no one source of information where you know, I always kind of get. Get what’s important. So there is Reddit machine learning community and there is also a very m. Active community about language models, I don’t remember the exact name. Local, Llama I think. Yeah, local Llama. So it’s one of the most active communities right now about everything LLMs. So interesting papers are shared there, almost all important breakthroughs. so I regularly visit these two communities ah on Reddit also I recommend everyone to because X or Twitter in the past it has a very good, I would say kind of information information bubble algorithm. So basically it sees what posts interest you. So when you like or you comment and it also sees what you post about and it creates this kind of, this home page specifically designed for you based on your interest. So I when I come to X a couple of times a day I scroll through my home and I see the most kind of important posts on this topic. So today without this information bubble there is just too much to consume. So I told to myself I will keep my bubble healthy. So I will not you know, I will not engage into some political conversations because if you start engaging all your, all your home will be about, about politics. so if you really want to stay focused on some topic, maybe you can create a separate account and you know, have an account about AI, have an account about you know, your, your personal interest. And also I really like and I find it unfortunate that it only exists on, on mobile. So when you have Chrome browser specifically the mobile Chrome version, when you open it and there is no tabs open, it shows you kind of news or articles on the topics that it believes will interest you. And again because Google knows so much about you, what kind of articles you read, what kind of websites you visit, I really find this suggested news in Chrome on mobile, Chrome super, super helpful. I don’t Know why Google decided only to keep it for Chrome Mobile? You cannot find, this, feed of news anywhere else. I tried even there is news, Google News, and Google News has some for you tab. And it’s not the same algorithm. So I really don’t like, for you on Google News, but I really like, the. The news, the articles on, on Chrome on Mobile. So this is how I follow news. I don’t follow anyone specifically because I, think every, Every person is, you know, limited to, to something. So I want to have kind of keep my eyes open, on. On different things. So yeah, I, really trust, some algorithms that, that they. That they find useful.
Harel Boren: Well, thank you very much. This was very, very, very helpful, for me as well, so really, thank you.
Andriy Burkov: Where can our audience follow your work
as a closing remark, where can our audience follow your work? on social media, whatever, on blogs, recent publications. if you can share with us, that would definitely be of interest for the audience.
Dr. Andriy Burkov: Yeah, well, most of the time I’m in two places, on LinkedIn, so you can find me by, by typing my name, Andriy Burkov in the search, and you will find it. And I’m also often, on Twitter and. Or X. The difference is that, on X, I just share, things that come to mind. And sometimes things that come to mind, they come to mind while I’m working on something. So I work on something and I’m like, oh, wow, this is interesting. And I post it right away, on X. So like, if you want to just subscribe to kind of a feed of conscious, from me, it’s x. on LinkedIn, I try to share something of value most of the time. So, some new information of some idea, that actually is, substantial in my opinion. So, and also, I have newsletters, also both on LinkedIn and on Substack. so those newsletters are weekly and they. I usually share, 10 links of what I feel was the most, important, information that I consumed during the week.
Harel Boren: Wonderful. Thank you very m much for that.
Are there any upcoming projects, things we have to keep an eye on
Are there any, upcoming projects, things we have to, be on our toes to, collaborations, initiatives, things you’d like to share with us before we part?
Dr. Andriy Burkov: Well, I have two. Well, I have one large project, that is going on this entire year. I’m working on my new book on reinforcement learning. And reinforcement learning, has been a big deal, since, Deep Seq released their, R1 model where they have shown that they used reinforcement learning to radically improve the capability of their model on mathematics, logic and coding. So I think that by the end of this year most of the of companies who offer you know, training or fine tuning models will support also reinforcement learning as a way as a kind of a training training mode because of high potential of reinforcement learning for any, any domains where you, you can you have verifiable results like in mathematics in code and so on. so I’ve been working on on the book and also this week, maybe early next week. I don’t know when the this issue this podcast will be will be out but I’m preparing also a blog post on how to train a neuro a language model to become a business specific classifier. So I’m entirely from, from scratch. So you, you work with on your own taxonomy, you work on labeling the data using language models and then you fine tune a language model using this data.
Harel Boren: Well this is something worth waiting for. This is something worthwhile. Do you have kind of an expectation is that two weeks from now?
Dr. Andriy Burkov: No, I think either over the weekend or maybe on Monday. On Tuesday I will be ready.
Harel Boren: Well I can say that with the rest of the community, those who know and those who don’t. I’m certainly, certainly looking forward for that. Andriy it has been fascinating time to sit together with you and mull over some, some of the topics on AI. I think we could take this blog with this discussion for another three hours without sensing that the time has passed. But it certainly has been enjoyable for me very much so thank you very much for your time and your insights and the information that you spent with us. Thank you very very much.
Dr. Andriy Burkov: Thank you Harold. For me it was also a pleasure to talk to you. So we’ll be happy to come back anytime.
Harel Boren: Well you’ll certainly be invited. I’m taking the rain check on that. Thank you very much!