OpenAI announces new GPT-4o Large Language Model (LLM)
The interestingly named GPT-4o(mni) model can be considered in a couple of different ways, depending on perspective.
First, as an enhancement of the current, mostly text-based GPT-4 model. Here are the key points relevant to this perspective:
- This cements OpenAI’s lead in the LLM race. The model’s benchmarks across the board seem to show small improvements in general reasoning, good improvements in math understanding, and significant improvements in coding. That alone, while not revolutionary, is still a very nice upgrade from the existing GPT-4 Turbo model and puts GPT-4o ahead of its closed and open-source competitors. While the leap in capabilities is not as stunning as the leap from GPT 3.5 to 4, this is still a nice across the board upgrade. And while I would have liked to have seen an increase to the existing 128K context window size, which now seems small compared to Google Gemini’s 2M, this is still one of the best (arguably, the best) overall LLM model packages right now.
- What makes this upgrade particularly noticeable however is its speed – the model seems orders of magnitude faster than its predecessor, and results stream back faster than most people can read them. Getting such quality output at such tremendous speeds is a difference maker for a lot of use cases in the general consumer space, the development experience, and the enablement of fast, near real-time agentic workflows and assistant capabilities.
- The cost to use the model through the API is also reduced by 50% over GPT-4 Turbo, a noticeable improvement that is definitely appreciated by GenAI builders everywhere. While still 10X more expensive than GPT 3.5, the reduction in cost (combined with the model’s speed) should be able to reduce the use of lesser models as a cost-saving factor, something particularly important in complex workflow use cases.
- A significant part of this announcement is also the availability of the model and its complementary custom chatbot / GPT creation capability on the free tier of ChatGPT. While clearly at least partially inspired by the availability of the free, high-quality Claude Sonnet model, this is the first time that the most powerful foundation model, and its custom GPT creation abilities, are available without subscription to the general public. This should facilitate wider adoption and create a significant barrier to entry to competitors. I believe this is the right approach and welcome the democratization of AI access for the masses.
As always with GenAI, go away for one week and your knowledge becomes stale. Continuing the rapid pace of advancement, this week’s developments point to increased opportunities and some concerns for the enterprise.
Secondly, it can be considered as the dawn of a new age of multi-modal, single-pipeline models from OpenAI, and as a potential way station on the path to GPT-5. Key points around this include:
- A different multi-modal architecture for the model, allowing it to seamlessly deal with any combination of text, voice, image and video. While we do not have all the technical details, OpenAI has done an admirable job of describing in lay person’s terms the previous three-stage pipeline (audio-to text –> text to text –> text to audio) and contrast it with the simpler (audio –> audio) way GPT-4o now deals with the same inputs.
- OpenAI seem to have done tremendous work around live latency of the audio and video inputs, audio manipulation, and general multi-modality and expressiveness with emotional experiences. The significance of this should not be underestimated – we are quickly moving forward from just text-based input / output to a generalized AI multi-sensory experience that will potentially have profound effects on society as it gets incorporated in phones and other digital devices. The OpenAI demos around this are instructive, with significant emphasis on education, spatial recognition, conversational behavior and other characteristics that make the modest almost-live translation capabilities feel like nothing more than a useful afterthought.
- These new capabilities, importantly, are not yet fully available for general use. Only text and image inputs and text outputs are supported at the moment. So, while we got cool demos and a fantastic glimpse into the future, these capabilities still clearly have some time to bake before they are considered safe and aligned enough to be released to everyone.
- It will be interesting to see if OpenAI can continue progress on reasoning and general AI capabilities or whether it is approaching a plateau, allowing its competitors to catch up. Even if progress slows down on this front, the new multi-modal architecture should enable a blend of new capabilities and experiences that go way beyond what we are getting now on the mostly text-based capabilities. Whether that alone is enough to get us to the ‘mythical’ GPT-5 leap in performance remains to be seen, however.
Overall, this is a very strong release with promises and hints of even greater capabilities later this year. We will discuss ways for how companies can take advantage of these new enhancements later in this post.
The multi-modal capabilities of GPT-4o, while not yet generally available, point to a potential paradigm shift in how users expect to interact with GenAI.
Google announces enhancements to its Gemini and Gemma LLMs
A day later than OpenAI, Google announced important enhancements to its current family of Gemini and Gemma LLMs, as part of its wider Google I/O conference.
In addition, some interesting additions to its GenAI portfolio also merit mention:
- The Gemini 1.5 Pro model has seen enhancements in its reasoning, coding, audio and video capabilities that keep it competitive with GPT-4 (although perhaps not with GPT-4o). All are welcome additions, and the availability of the model for free on Google AI Studio and through the API is a significant differentiator for builders wishing to experiment.
- Improving on its already impressive 1M context window, Gemini 1.5 Pro has also released a limited 2M context window, enabling the consumption of large amounts of textual, audio and video information way above what all its competitors can offer. This is arguably the biggest differentiator between the Google and OpenAI offerings, and combined with the currently available audio and video inputs (which, as stated above, are currently missing from GPT-4o) should create significant opportunities for Google in cases where large, multi-modal capabilities are desired.
- A new smaller Gemini Flash model is also now available, targeted at use cases that require fast responses and low latency. The model can also be accessed for free through Google AI Studio and shares Gemini Pro’s large context window and audio and video capabilities. This is probably not going to be competitive with GPT-4o which is at least as fast and much more capable as a model, but its multi-modal capabilities and large context window, combined with its small size and reasonable cost, should see it serve niche use cases well.
- Google also announced the future availability of a new version 2 of its Gemma family of smaller LLMs, claiming comparable performance and latency to models of about twice the size.
- Of particular interest to developers should be Firebase GenKit, a GenAI development framework in the style of LangChain and Microsoft’s Semantic Kernel, with good integrations into the Google ecosystem and with the admirable aim of simplifying the development experience for the expanding Google AI portfolio.
- Additional capabilities that Google announced include various other fine-tuned models for specialized vision tasks, Gemini Live as a low-latency multi-modal conversational model in the same general area of capabilities as the demos we have seen from GPT-4o, and GenAI co-pilot capabilities in Google Workspace apps to compete with Microsoft Office Copilot.
- It is also useful to note that the Gemini Ultra model is nowhere to be seen in these announcements, potentially signifying Google’s emphasis on mid-level, fast models over the absolute largest, ‘best’ variants.
Overall, this is a good update from Google, solidifying its lead in several categories, catching up in several others, and taking some steps towards unifying and converging its vast AI offerings into a more coherent set.
Ilya Sutskever leaves OpenAI
While we are not in the habit of discussing corporate politics, we should at least mention the departure of OpenAI’s chief scientist a day after the announcement of GPT-4o. Widely considered as one of the smartest AI researchers in the world, he could choose to go to a competitor (Llama, Google or Anthropic perhaps), launch his own startup, write a book, and / or pursue many other interests. OpenAI will definitely miss his contributions and emphasis on safety and alignment – but how this affects both the day-to-day and the future direction of the company will take some time to crystallize.
The promised future audio and video capabilities of GPT-4o, combined with its improved logical reasoning and lower cost could enable new use cases for both the buy-side and the sell-side.
What does this all mean?
As always with GenAI, go away for one week and your knowledge becomes stale. Continuing the rapid pace of advancement, this week’s developments point to increased opportunities and some concerns for the enterprise.
We highlight the most important ones below:
- There are opportunities to use the cheaper, faster GPT-4o model now for applications where it was previously cost-prohibitive. Enterprises should look especially at agentic workflow capabilities that may have been implemented with lesser models and consider whether to replace those with the fast, intelligent, and potentially cheaper new model. Conversely, this is now a very good time to experiment further and implement agent-based tools with the new model, given its increased speed and improvements in reasoning and following instructions more closely.
- In the same way, GPT-3.5 should probably be phased out from enterprise and product use. When consumers can have a better experience for free using GPT-4o, developing experiences and products based on a clearly inferior model is probably unwise, unless cost is a significant factor and 3.5’s reduced capabilities are acceptable.
- Using Clause Sonnet has suddenly become very expensive for the capabilities it offers, when GPT-4o is significantly better and not that much more expensive. Even Claude Opus, which was already more expensive than GPT-4, now feels somewhat overpriced for its capabilities.
- The multi-modal capabilities of GPT-4o, while not yet generally available, point to a potential paradigm shift in how users expect to interact with GenAI. Simple chatbot interfaces may no longer be enough, and video, audio and image-based experiences will have to be thought through properly and implemented. Do not just think of text inputs and pipelines anymore – start incorporating multi-sensory ways of looking at and presenting information to your users. The role of product and UX experts will also need to become more central in GenAI conversations, as the need to move beyond the simple text box will start to dominate conversations.
- While GPT-4o is clearly a fantastic product, do not underestimate the Google model suite of capabilities. The Gemini models have significantly better multi-modal capabilities right now – and we do not know how good the OpenAI demos translate to reality for now. Google’s models can prove a great steppingstone to experimenting with multi-modality until OpenAI fully releases these capabilities and we can have more equal comparisons.
- The large context window in Gemini 1.5 Pro should reduce and, in some cases, eliminate the need for Retrieval Augmented Generation (RAG) techniques. Why create vector embeddings and do similarity searches to ground the model when so much more information can be reliably and easily supplied directly to the LLM through a prompt? Enterprises should experiment with the larger context window, and use it not just with text inputs but also with multi-modal inputs where appropriate. This is also an area where Google is clearly ahead of OpenAI, and that lead should not be underestimated.
- The Gemma 2 family of models (when released) are great candidates for local inferencing and experimentation within the secure confines of the enterprise. The Meta Llama 3 family are probably its only serious competitors, but if Google’s estimates hold it will be faster and cheaper to run the Gemma models in a typical enterprise developer workstation or low-powered server, democratizing GenAI within the enterprise.
- We have encountered subtle shifts in behavior of the GPT-4o model using existing prompts. We highly recommend rerunning your eval metrics and traces to quantify this as part of your transition due diligence, and pay special attention to agent and structured output workflows.
- Both OpenAI and Google are putting pressure on AWS Bedrock (and the Anthropic suite of models by association) by releasing powerful models not available in the service. Expect Anthropic to have something in the works to counter these developments, but for now the Opus family of models seems somewhat less compelling than before.
- All of the above items are of particular importance to the FSI community. The promised future audio and video capabilities of GPT-4o, combined with its improved logical reasoning and lower cost could enable new use cases for both the buy-side and the sell-side. Some examples that only scratch the surface of what is now possible include automating and optimizing text-based workflows, enhancing investment research, and assisting with trading strategy selection, market news analysis, portfolio optimization, marketing, research report automation, individualized product recommendations, performance attribution, stress testing and much more.
- Finally, the emotionally-based conversations with the AI that are about to become reality should also enable previously unimagined ways of interacting with clients and putting firms’ branding and messaging across in a significantly more effective manner. And we haven’t even touched on the effect on the day-to-day lives of FSI employees and how it will affect their connection to work and everyday life. A follow-up post will explore these aspects in more detail.
How can Lab49 help?
We are a passionate group of GenAI and Data experts with combined decades of experience in the Data, ML and AI field. We actively research all the latest developments in the space and have strong interest in progressing the GenAI story on behalf of our clients. We love GenAI – and want to help you take advantage of what we believe to be the greatest technology disruption since the widespread adoption of the Internet and the smartphone.
We work with all major cloud and LLM providers and bring our financial services expertise and understanding to bear, helping translate the rapid pace of GenAI technology development to actionable, productized experiences for our clients. Some of our major GenAI technology capabilities include deep expertise with the major LLM families, including OpenAI GPT-4, Anthropic Opus, Google Gemini and Meta Llama. We are partners with Azure and AWS and have deep expertise in Azure AI, Amazon Bedrock and Hugging Face among many other GenAI service providers. We also have deep expertise in the GenAI software ecosystem, including software development frameworks like LangChain and LlamaIndex, vector databases like Pinecone and Chroma, embedding models ranging from Cohere to Nomic, and have developed several specialized solutions with these and other tools in the financial industry. Let’s talk – we can help you understand what needs to change in your existing GenAI implementations to take advantage of these new capabilities; help you navigate the GenAI landscape with our hands-on engineering experience; create compelling UX for all your GenAI stories; help design and implement a Responsible and Ethical GenAI strategy for you and your clients; fine-tune foundation models on your own data; design and implement GenAI Proofs-of-Concept (POCs); move POCs to production with instrumentation, tracing, and monitoring; and create new products and capabilities for your internal and external stakeholders.