Czytaj

arrow pointing down

OpenAI’s o1 model, Polish Bielik 2 and other updates! | AI News

OpenAI surprises with the o1 model that “thinks”, while the Polish Bielik 2 analyses news in real time. Explore the latest developments from the world of AI!

Good morning! We have just published the latest episode of our podcast! You can watch the video version here:

You can also listen to it on Spotify – click.

You can read the text version below. Enjoy!

Ziemowit Buchalski: Hello, and welcome to the Beyond channel! As usual, we’ll talk about what interesting things have happened recently in the field of artificial intelligence.

Chat o1-Preview

Jan Twardowski: Good day! What has happened? In the previous news roundup, you and Michał mentioned that a new model from OpenAI was supposedly coming out. And you were right! A model called o1 has been released. It is essentially available in a few versions, because we have the o1-Preview version, which is accessible. However, it is a slightly stripped-down version compared to o1, which has not yet been published and has the full capabilities.

There is also the o1-mini version, which is a less feature-rich version with fewer capabilities. But what can this model do? This is a model that many people have already tested, and we have tested it too—it performs significantly better in all tasks that require a certain thinking process.

Ziemek: Exactly, what are the assumptions? Why is this such a breakthrough? Why is this interesting and potentially genuinely new?

Chain of Thought in the o1 Model

Janek: Is it a breakthrough? I don't know. It is also difficult to say exactly what OpenAI changed in its new model. Much evidence suggests that what we are observing resembles something called "Chain of Thought"—a mode of operation in which the model tries to construct a chain of thoughts to execute a task.

When we ask a question in a standard chat, we immediately get an answer. In the new version, available in the "o1 Preview," the words "Thinking" first appear. The model actually seems to be thinking, trying to devise an action plan to solve the problem. Thanks to this, it performs significantly better in tasks such as mathematics, physics, and programming. For creative tasks, such as writing or solving benchmarks, the results are comparable to the GPT-4.0 model, and sometimes even slightly weaker.

However, everyone noticed a difference in mathematical tasks—the model performs much better. Even Professor Dragan stated that we have finally seen artificial intelligence that can truly do something more. He compared it to the concept of so-called "System 1 Thinking," which is intuitive thinking that works on autopilot, in contrast to "System 2 Thinking," which requires planning and conscious information processing. This is exactly how this model tries to operate.

The model can now handle tasks that were previously problematic for it, such as those that might be difficult for math students. It is not as fast as the previous version—the GPT-4.0 model generated answers within two seconds, while o1 Preview needs about 30 seconds. This delay results from the greater computational resources required to work on tasks, which makes using the model via API more expensive.

In the GPT Plus version, users also have limitations—they can only ask 30 questions per week, which means they can quickly reach the limit and have to wait 7 days for the next opportunity to use the model. However, the Preview version shows enormous potential, although the full capabilities of o1 are not yet available.

Testing the o1 Model from Open AI

Ziemek: I tested the model on two tasks. The first came from extended mathematics for the second year of high school—a quadratic function task. The model handled it very well, correctly carrying out the proof. Moreover, I asked it not to use Vieta's formulas, and it understood this and solved the problem using another, equally correct method. This was impressive because it performed better than most students who are not in extended math classes.

The second task was more unusual—a riddle about a suitcase full of money that several people passed around to each other. The model presented a dozen logical steps, analyzing how the sum of money changed. Surprisingly, at some point, it decided to check an alternative scenario: "What would happen if these people went by bicycle?" Of course, this had nothing to do with the original task, but it was a creative attempt to check a different path. This resembled a brainstorming session that the model conducted with itself, which we would have previously called a "hallucination." However, here it was part of creative problem-solving, and the final answer was correct.

A Task That o1 Cannot Handle

Despite these successes, I managed to find a task that even o1-Preview couldn't handle. It was about writing out all the numbers from 1 to 100 in alphabetical order in Polish. No model can handle this yet, so we are waiting for the full version of o1. It is worth collecting such tasks as tests for future models.

Janek: An interesting feature of the new model is the ability to preview the planning phase—we see how the model outlines the steps it intends to take. This suggests that a new architecture or different training data is hidden under the hood. It looks like OpenAI has combined the Chain of Thought with an earlier planning stage. Similar effects could be achieved in older models using more complicated prompts, but now you just need to enter the question, wait a little longer, and the result is significantly better.

Ziemek: Well, yes, but I understand that OpenAI is not the only entity that has released something new. Maybe something interesting was happening in our Polish backyard too?

Polish LLM Bielik in Version Two

Janek: Yes, you're right—interesting initiatives related to AI models are also appearing on the Polish market. It is worth mentioning the Polish model Bielik, which recently received its second version.

The new version of Bielik is a significant step forward compared to the previous one. First of all, the model is much larger—it has more parameters, which means it has a greater ability to process information and understand language. Additionally, it has been trained on a much larger data set, which is also necessary for larger models. The context window size has also been changed, which affects how much data the model can analyze simultaneously.

One important novelty is that Bielik 2 has been made available in a more accessible form—it no longer needs to be run locally. Previously, the model could only be tested in a demo form on the Hugging Face platform, and now it has its own website where it can be freely tested. This is a great convenience for users who want to check the model's capabilities without having advanced technical infrastructure. The link to this page can be found below this article or material.

The Newsroom Feature in Bielik

The model performs quite well, especially in text analysis tasks, such as classification or content evaluation for the presence of specific elements. The interface provided by Bielik includes a function called Newsroom, which is new in the context of this model's operation. This tool largely resembles systems such as RAG (Retrieval-Augmented Generation). It allows the model to search the internet—more specifically, news—for the latest information.

Newsroom enables asking questions about the latest events, e.g., from yesterday, and the model is not limited only to the data on which it was previously trained. Instead, it accesses a news database or searches the internet to provide an up-to-date answer based on the latest data. This is a huge step forward because many AI models have limitations related to the recency of knowledge—Bielik, thanks to this function, is able to provide more up-to-date answers, which is especially important in the context of changing information.

The whole thing works on a principle similar to RAG, where the model processes new data, understands the question, and generates an answer using the latest information. This is not static knowledge that the model has saved in its dataset but dynamic searching and processing of data in real-time.

Rather, it is an architectural approach where we are simply able to inject the necessary data additionally. I don't know if they are retrieved earlier or at the moment the query is submitted, but something like this exists. This is not an interface that looks like a final production version; rather, it is intended to show the model and allow it to be tested without requiring knowledge of how to implement it oneself. This is probably also a lesson learned. When it was exposed with limited resources, the model was criticized for its quality, which was due to the fact that it was not a final product but only a demo, and people did not know how to test it. Now they have gone a step further.

New Google Pro and Flash Experimental Models

Google is also introducing changes because OpenAI is not the only one modifying things. Google operates more discreetly because they have been offering Gemini models in two versions for some time: Pro and Flash. Pro is more advanced, and Flash is faster and simpler. In most cases, Flash is sufficient. These models are currently available in experimental versions. "Experimental" means that it is an experimental version that may change or disappear. Google provides them without recommending use in production and does not guarantee that the model will remain available.

Similar to the versions without the experimental variants, these models behave differently. Tests show that using a smaller and less complicated prompt can force the model into more complex actions, allowing effects similar to the o1 model to be obtained, which offers more accurate reasoning and analysis. In such cases, when detailed analysis of a report and drawing conclusions is needed, the experimental model, using a simple prompt, can provide a much more complex answer.

I don't know if this is a response to the Preview version. Probably not, or it is less visible because it is a new model in the experimental version. However, it is currently free. In the future, if the experimental version is maintained, it will likely be available at a similar price to the Flash models. It will become the next version of this model because the field we are talking about changes very quickly. I hope that the information we are presenting now will be up-to-date on the day our episode is published. We are keeping our fingers crossed that the properties of the experimental model will be preserved and that it can be used for free.

Recently, during conference talks, I encountered the opinion that OpenAI and Google have chosen different paths. Google focused on a large context window, already offering a two-million-token window, while OpenAI focused more on reasoning and the thought process, i.e., on more advanced strategies. It is unclear whether they compete in everything or if everyone has gone in a different direction with their own strategy. We will see what the future brings.

Custom GPTs from Gemini – Gems

Similar to how OpenAI offered Custom GPTs, which were models packaged with our prompts that allowed us to create an assistant tailored to our needs, Google introduced something called "Gems" in the Gemini interface. This allows you to select a model, add a custom prompt to it, and talk to an assistant that has the provided prompt in its memory.

Ziemek: The difference is that Gems from Gemini cannot be enriched with other data sources—everything must be contained in the prompts. In Custom GPTs, you can add Word, PDF, and Excel files, and the knowledge from them will also be used.

Artificial General Intelligence – AGI

It is interesting that OpenAI and Google have different strategies. OpenAI is striving to create AGI (Artificial General Intelligence), or strong intelligence, and its subsequent products are approaching this goal. Facebook, on the other hand, chose a different path, promising the Open Sowa model. There are legal doubts, so we won't discuss them, but it's worth knowing that Facebook admitted that their LLaMa model was trained on publicly available data on Facebook. This means that users who published public content on Facebook may be co-authors of one of the best models available for free.

Google Publishes the Statistically Gifted LLM DataGemma

Janek: Returning to Google, in addition to the flagship Gemini model, the company also publishes an open model called Gemma. It is simpler, has lower requirements, but allows it to run on a local computer and is sufficient for many applications. Google recently introduced DataGemma—a variant of this model that has a strong connection to data from Data Commons, a huge collection of statistical and research data. DataGemma uses this data to ensure that the model's answers are based on real data, which minimizes the risk of hallucinations, i.e., providing incorrect information. This model checks the correctness of the answer both before and after providing it, which makes it ideal for applications requiring precise numerical data.

In summary, Google, OpenAI, and Facebook are developing their AI models in different directions, adapting them to different user needs. Each of these technological giants has its own strategy that will impact the future of artificial intelligence.

Ziemek: Well, but how can you use it? Is it only through the programming interface, or is there something more accessible?

I don't think it's just a programming interface. It's an open model that you can just run. Maybe there are some interfaces that expose it somewhere, but it's probably not a service where you go to Gemini and have access to the model.

IMAGEN 3 – Image Generation in Gemini

However, you can go to Gemini and generate an image. And that is new. Google has been working on the IMAGENmodel for a long time, which was not yet available to us. This model is similar to Midjourney or DALL-E, meaning it generates images based on text. Currently, we can use it both from the Gemini interface level, which allows us to ask the model to draw something, and more programmatically, through the Google console.

If we are in the US, or not in Europe, or have a VPN, we can also use a service called IMAGEN FX, which is a prompt editor. The quality of the generated image strongly depends on how well the prompt is written. IMAGEN FX is an editor that improves and suggests changes to prompts. This means that we now have access to a model that allows image generation, and Google has more publicly joined the category where we can create graphics.

Tom Hanks Warns Against AI Scams

Ziemek: Speaking of image generation, there was a recent high-profile case involving Tom Hanks. He is a publicly known person whose image was being used. It was likely generated by people who wanted to mislead people using the image of a famous person. Tom Hanks said on his profile: "Hey, listen, it's not true. I'm not advertising any products or services here. Don't use it because these are generally bad people who are stealing my likeness."

Runway – Video-to-video

Janek: All these tools are, on the one hand, really cool and advanced; they can generate images with great consistency—each subsequent generation depicts the same person. However, you have to be careful about what you create and watch. We're talking about images, but the same applies to video. Even Runway ML can generate images very well, and newer and newer models have been able to generate not only images based on a text description but also video for some time. They recently published the "video to video" version, meaning that based on an existing video, another film can be prepared. For example, we can upload a recorded sequence and ask the model to improve or change it according to our expectations. Moreover, they recently made this available not only through the interface but also through an API, which allows this tool to be easily used on a large scale.

Czy wiesz, że... kanał Beyond AI pozwala na pozyskanie nowych, unikalnych umiejętności AI minimum 4 razy w miesiącu! Sprawdź to!

Generating Subsequent Video Shots

Ziemek: In one of the demos I watched, there was a girl looking at a flower in a meadow with a magnifying glass. This was a real film recorded by a person, and the model generated the next frames. The generated frames did not show the girl but the view from her eyes through the magnifying glass onto a flower that was not in the recording at all before. This allows, for example, not having to record everything—if we forget to record something, we can add the missing scenes or fragments. It was really impressive that there was no visible difference between the quality recorded by a human and that generated by a computer.

Janek: At the level of individual photos, humanity has reached the stage where we can create photorealistic things. Now the development involves creating longer and longer materials—from one photo, we move to a three-second clip, which is actually an animated photo, and then to increasingly longer productions. Currently, you can find two-minute films or trailers that are entirely generated by AI—in both the video and audio layers, as well as the script. In many cases, they are indistinguishable from those created traditionally through recording and content creation.

Reflection 70b – The Creators' Blunder

Ziemek: We are talking about the positive aspects that succeed here, but not everything always goes according to plan. There are many cases that can be called blunders. One of the latest is the publication of the Reflection 70B model, a large model, but still significantly smaller than many current models. The creator of this model claimed that it was many times better in quality than current solutions while being much smaller. However, after the model was published, many people tried to replicate these excellent results, but it turned out that they could not be achieved. Now everyone is wondering whether this is some kind of accident, or perhaps there was an intention to mislead people. There are voices suggesting that the Anthropic Sonnet 3.5 model was used underneath. Moreover, tests show that when instructions containing the word "Sonnet" are given to the model, it did not execute them, cutting it out, as if there were instructions prohibiting the use of this name so as not to accidentally reveal that it uses this solution. Is this true? It is unknown. The creator responded to this, explaining it was due to a faulty configuration. He is supposedly going to change something, but he hasn't done it yet, so the matter is not clear. Therefore, we advise against using Reflection until the matter is clarified.

Ziemek: That's why it's worth listening to our podcast to know what to use and what to avoid.

Janek: Thanks!

Visit Beyond AI on YouTube

The Beyond AI channel is created by specialists from WEBSENSA, a company that has been providing AI solutions to leading representatives of various industries since 2011.

Inne wpisy z tej serii

2024 AI Highlights: Key Developments and What’s Next

A review of the most important AI milestones of 2024 – from the debut of Rabbit R1 to the launch of GPT-O1 Preview and the AI Act. An overview of major AI trends.

Will 2025 mark the end of the AI revolution? | AI News

Will artificial intelligence slow down in 2025? An analysis of AI development forecasts, model naming trends, and innovations such as Gencast for weather prediction.