Spring AI

Basic Concepts

What is AI?

AI , or  Artificial Intelligence, is , as the name suggests, the science and technology that enables machines to simulate human intelligence.

Let’s use an example to illustrate this:

A typical computer program is like a vending machine. You press a specific button (input), and it gives you a specific drink (output). All its behaviors are pre-defined rules by the programmer.

Artificial intelligence programs are more like children learning. You show them many pictures of cats and dogs (data) and tell them which is a cat and which is a dog. After learning, when you show them a picture of a cat they’ve never seen before, they can recognize it. They learn patterns from the data themselves , rather than relying on hard-coded rules.

Therefore, the core of AI is learning from experience and making decisions or predictions based on what it has learned.

The most mainstream and attention-grabbing branch of AI currently is  generative artificial intelligence , also known as  AIGC (Artificial Intelligence Generated Content) . Unlike traditional AI (primarily used for data analysis, such as facial recognition), its goal is to automatically generate or create various types of digital content using artificial intelligence technology , such as writing articles, reports, translations, and programming.

To better understand AI, let’s first understand some of its common terms.

Model

A model is  the core of an AI system ; it is the result of training an algorithm on data . Essentially, a model is a mathematical function that receives input data, performs calculations, and then produces an output.

When we say “calling an AI,” we are actually using that “model.” Model files vary in size, ranging from a few MB to tens of GB.

You can think of an AI model as a  “virtual brain .” This brain acquires skills and knowledge through “training” or “learning” on massive amounts of data. When asked a question, it needs to use the knowledge it has acquired to solve the problem.

Large Language Model (LLM)

LLM (  Large Language Model) : A deep learning- based model trained on massive amounts of text data . Its main task is to understand and generate human language. LLMs are representative of the current generative AI trend. Their defining characteristic is ” large ,” reflected in the large amount of training data and the enormous number of model parameters.

It can be viewed as an “expert brain” that has undergone massive training   . By learning almost all texts on the internet, it has mastered the grammar, syntax, factual knowledge, and contextual logic of languages, possessing hundreds of billions or even trillions of parameters. And because it has learned everything, it can handle a wide variety of topics and tasks.

Prompt

Prompt: Instructions, questions, or contextual information provided by the user to the AI ​​model . The model generates a corresponding response based on the prompt. The quality of the prompt directly determines the quality of the AI’s answer.

The process of designing and optimizing prompts is known as ” prompt engineering ,” and it is an emerging skill.

The prompts are like ” work orders” given to the AI, this “genius .” The clearer and more specific the order, the higher the quality of the completed work.

For example:

Simple prompt : “What is the capital of France?” -> Model answers: “Paris”.

Complex prompt (role-playing) : “Imagine you are a senior nutritionist. Please design a healthy lunch menu for me (an office worker who sits for long periods) for a week.” -> The model will provide a detailed menu in the voice of a nutritionist.

Token

Token : This is the basic unit of text that the model processes and understands . It is not entirely equivalent to an English word or a Chinese character. Before processing, the model breaks the text down into tokens. Tokens are also the basic unit for billing and measuring the length of text processed by the model.

In English, the word  unbelievable  can be broken down into three morphemes: “un”, “believe”, and “able”.

In Chinese, the phrase “I like programming” can likely be broken down into four word units: “I”, “like”, “enjoy”, and “programming”.

Different models use different word segmentation rules, and the same word may be split into different word units in different models.

Having grasped the basic concepts of AI, let’s now look at  Spring AI-  related content.

What is Spring AI?

Spring AI is an open-source artificial intelligence application framework built on the Spring ecosystem . Its core goal is to simplify the integration of AI functionality into Java applications , enabling Java developers to efficiently build generative AI applications.

Official Documentation: Introduction :: Spring AI Reference Documentation – Spring Framework

Spring AI provides abstractions that form the foundation for developing AI applications . These abstractions have multiple implementations, allowing for easy component switching with minimal code changes .

Spring AI offers a range of powerful and practical features, making it a fully functional AI application development framework:

1. Unified multi-model support : Supports interaction with numerous mainstream AI model providers, including  OpenAI , Microsoft , Amazon , Google  , and  Anthropic  . Whether the model is in the cloud or deployed locally (such as through Ollama), it can be invoked through a consistent interface.

2. Powerful data integration capabilities : This is a major highlight of Spring AI. It has built-in support for vector databases such as Chroma, Pinecone, and Redis.

3. Seamless integration with the Spring ecosystem : As a member of the Spring family, it can naturally   work with other well-known projects such as Spring Boot and Spring Data .

4.  Simplified development model: Allows AI models to request and execute client-defined functions as needed, thereby accessing real-time information or triggering specific actions.

After understanding the relevant concepts, let’s get hands-on and experience  Spring AI.

Quick Start

Environmental requirements

JDK version: JDK 17 or above  (JDK 21 recommended). This is mandatory because Spring Boot 3.x itself requires JDK 17+.

Spring Boot version: Spring Boot 3.2 or above  , specifically 3.3.3, 3.4.3 or 3.5.0. Just choose a stable latest version of 3.x.

AI Service Credentials: A valid  API Key, requiring an account and API Key from an  AI service provider (such as OpenAI, DeepSeek, Alibaba Cloud, etc.).

Project creation

Spring AI  has designed  the spring-ai-openai-spring-boot-starter specifically for OpenAI and compatible API services , for quickly integrating large model language capabilities into Spring Boot applications:

Create a Maven  project normally  (note the JDK and Spring Boot versions), and add the Spring AI dependency :

        <dependency>
            <groupId>org.springframework.ai</groupId>
            <artifactId>spring-ai-openai-spring-boot-starter</artifactId>
            <version>1.0.0-M6</version>
        </dependency>

For dependency version information, please refer to: https://docs.spring.io/spring-ai/reference/getting-started.html

Configure the API key in application.yml  

spring:
ai:
openai:
#DeepSeek
api-key: API Key applied for
base-url: https://api.deepseek.com
chat:
options:
model: deepseek-chat
temperature: 0.7

in:

spring.ai.openai.base-url : The URL to connect to

spring.ai.openai.api-key : The DeepSeek API key you requested.

spring.ai.openai.chat.options.model : The DeepSeek LLM model to use. 

`spring.ai.openai.chat.options.temperature` : Controls the randomness and creativity of the text generated by the model: Low temperature (close to 0.0)  → conservative, deterministic, predictable; High temperature (close to 2.0)  → adventurous, diverse, imaginative. In other words, the lower the temperature  value, the more similar the results will be for the same question. Furthermore, it is not recommended to modify both `temperature`  and  `top_p` in the same completion request  , as the interaction between these two settings is difficult to predict.

Configuration options can be found at: https://docs.spring.io/spring-ai/reference/api/chat/deepseek-chat.html

At this point, the project creation  and  DeepSeek integration are complete  . Next, we will write an interface to call the model.

Interface writing

@RestController
@RequestMapping("/deepseek")
public class DeepSeekChatController {
@Autowired
private OpenAiChatModel deepSeekChatModel;

@GetMapping("/chat")
public String generate(String message) {
return deepSeekChatModel.call(message);
}
}

Run and access 127.0.0.1:8080/deepseek/chat?message=who are you to test.

As described above, we  completed the interaction with the model through ChatModel. 

In the  Spring AI  framework, ChatModel  and  ChatClient  are the two core interfaces for building conversational AI applications. We will now examine these two interfaces separately.

Core Interface

ChatModel 

ChatMode  communicates directly with the underlying AI models (such as GPT-4, Claude, etc.) to handle raw requests and responses.

Therefore, ChatModel  is more low-level and more flexible in use.

@Service 
public class ChatService {

@Autowired
private ChatModel chatModel; // For example, OpenAiChatModel

public String askQuestion(String question) {
// 1. Construct the message
UserMessage userMessage = new UserMessage(question);

// 2. Create the Prompt
Prompt prompt = new Prompt(List.of(userMessage));

// 3. Call and get the complete response
ChatResponse response = chatModel.call(prompt);

// 4. Extract the content from the response
return response.getResult().getOutput().getContent();
}
}

ChatClient

ChatClient provides a smooth API layer  on  top of ChatModel  , simplifying common usage patterns.

In other words,  ChatClient  is  a wrapper around ChatMode  .

Therefore, ChatClient  is more advanced and also more concise.

@Service
public class ChatService {

@Autowired
private ChatClient chatClient;

public String askQuestion(String question) {
// Get it done in one line
return chatClient.call(question);
}
}

As you can see, ChatClient  is simpler and more intuitive to use.

Comparison between ChatMode  and  ChatClient  :

DimensionChatModelChatClient
Abstraction levelThe bottom layer, close to the original modelHigh-level, business-oriented
Return valueChatResponse (a complete response object containing rich metadata)ChatResponse (plain text that directly retrieves content) or streaming response
How to useThe Prompt object needs to be constructed manually.Provides a streaming builder pattern
Control granularityFine controlquick and easy

Message Type

In Spring AI, all message types implement  the `org.springframework.ai.chat.messages.Message`  interface, and the messages in the system are designed to simulate different participants in a multi-turn dialogue.

Message TypeCorresponding rolecore role
SystemMessageSystem/DirectorDefine the AI’s background, role, behavior, and response style. This is typically provided at the beginning of the conversation, setting the tone for the entire session.
UserMessageUser/QuestionerRepresenting the human side in human-computer interaction, it is the driving force behind the dialogue.
AssistantMessageAssistant/AI itselfThis represents the AI’s responses in previous rounds. It ensures the continuity of multi-round dialogue.
FunctionMessageFunctions/ToolsThis represents additional information or operational results obtained by AI through function calls.
ToolMessagetoolIt has the same function as ToolMessage and is an alias for ToolMessage.
MediaMessagemultimediaThis represents message data of other types besides text, such as images.

Among them, the most commonly used are  SystemMessage, UserMessage, and AssistantMessage.

SystemMessage

SystemMessage  is typically used to set the AI ​​assistant’s identity, personality, behavioral guidelines, and dialogue rules. It usually appears at the beginning of the conversation, setting the tone for the entire dialogue.

For example, we can pre-set its roles :

@RequestMapping("/chat") 
@RestController
public class ChatClientController {
private ChatClient chatClient;
public ChatClientController(ChatClient.Builder chatClientBuilder) {
this.chatClient = chatClientBuilder
.defaultSystem("Your name is Xiaoxiaoyu, a professional intelligent Q&A AI assistant, proficient in Java and Python, answering questions with a friendly attitude")
.build();
}

@GetMapping("/call")
public String generation(String userInput) {
return this.chatClient.prompt()
.user(userInput) // User input.call
() // Call the API
.content(); // Return the response
}
}

In  ChatClient  ,  the default system message for the AI ​​model is set via defaultSystem  . The system messages set through chained calls to ChatClient.Builder serve as the “initial instructions” for each conversation ,  being  injected into the context of each conversation to guide the AI’s response style or identity setting.

At this point, we access the interface and ask for its identity again.

UserMessage

UserMessage  represents the specific question or instruction we are asking. The “Who are you?” input above is  a UserMessage.

AssistantMessage

AssistentMessage : This is the response provided by the AI ​​model.

AssistantMessage  is key to enabling  coherent multi-turn conversations  . Each time the AI ​​responds, this response can be saved as  an AssistantMessage  and sent to the AI ​​as part of the historical context on the next request.

Output format

Structured output

To receive structured output from an LLM , Spring AI supports changing the return type of  the ChatModel/ChatClient  methods from Spring to other types.

The model output is converted into a custom entity using  the entity()  method.

For example:

@RequestMapping("/chat")
@RestController
public class ChatClientController {
private ChatClient chatClient;
public ChatClientController(ChatClient.Builder chatClientBuilder) {
this.chatClient = chatClientBuilder
.build();
}

@GetMapping("/entity")
public String entity(String userInput) {
Recipe entity = this.chatClient.prompt()
.user(String.format("Please help me generate a recipe for %s", userInput))
.call()
.entity(Recipe.class);
return entity.toString();
}

record Recipe(String dis, List<String> ingredients) {}
}

Streaming output

First, let’s compare and contrast what streaming output is:

Traditional output (non-streaming ): Returns the entire output all at once only after all data has been generated , resulting in a long wait for the user → then suddenly the complete answer is displayed. It’s like sending a regular letter ; you write all the content before sending it, and the recipient receives the entire letter at once.

Streaming output : Generates and returns results simultaneously , pushing partial results immediately and starting to display almost instantly → growing word by word. Like making a phone call , you can hear the other person speaking at the same time.

Streaming output process:

User question: “Please write a short essay about spring.”

AI model generation process:

“Spring”… (Return Now)

Spring is here… (Continue back)

“Spring is here, and all things are coming back to life”… (Continue to return)

Until a complete answer is generated, the SSE protocol is followed.

Spring AI primarily   achieves streaming output  through reactive programming , using the `stream()`  method to generate a  `Flux<String>  ` stream.

@RequestMapping("/chat")
@RestController
public class ChatClientController {
private ChatClient chatClient;
public ChatClientController(ChatClient.Builder chatClientBuilder) {
this.chatClient = chatClientBuilder
.build();
}

@GetMapping(value = "/stream", produces = "text/html;charset=utf-8")
public Flux<String> stream(String userInput) {
return this.chatClient.prompt()
.user(userInput)
.stream()
.content();
}

record Recipe(String dis, List<String> ingredients) {}
}

However, let’s consider this question: Since the HTTP protocol itself is designed as a stateless request-response model, which means that strictly speaking, it is impossible for the server to actively push messages to the client, how can we achieve streaming responses from the server?

We can  achieve streaming using SSE (Server-Sent Events) , which allows the server to proactively push data streams to the browser.

SSE Protocol Introduction

SSE  is a lightweight real-time communication protocol based on HTTP  . Browsers receive and process these real-time events through the built-in EventSource API. 

The server declares to the client that the next message to be sent is  a streaming message.

At this point, the client will not close the connection and will wait indefinitely for the server to send a new data stream.

SSE core features:

1. One-way communication : Data can only be pushed from the server to the client . The client cannot send data to the server through this connection (except for the initial connection establishment request).

2. Based on HTTP/HTTPS : SSE uses the standard HTTP protocol, which means it can easily bypass most firewalls and proxy servers without special configuration.

3. Persistent connection : The client initiates a normal HTTP request, but the server keeps the connection open instead of closing it after sending a response.

4. Text data stream : The server continuously sends a stream of text data that follows a specific format to the client through this persistent connection.

5. Automatic reconnection : The SSE protocol has a built-in reconnection mechanism. If the connection is unexpectedly lost, the browser will automatically attempt to reconnect to the server.

SSE Data Format

When a server sends SSE data to a browser, it needs to set the required  HTTP header information.

Content-Type: text/event-stream;charset=utf-8
Connection: keep-alive

The entire data stream consists of a series of messages. Each message comprises one or more lines of text , each line beginning with a field name , followed by a colon and a space, then the field value. Each message ends with a blank line (i.e., two consecutive newline characters \n\n).

Each line format: [field]: value\n

Common values ​​for the field include: data, event, id, and retry.

data

`data` : The message body, the most important field, carries the actual content of the message . If a message contains multiple  data lines, the client will \nconnect them with newline characters (“) to form a complete data string. It can be used to transmit any text data such as JSON strings, plain text, and XML.

Example:

data: This is a simple message\n\n

data: !\n\n

data: Hello\n

data: World\n

event

`event` : Event type, used to specify the custom type of the message. If this field is provided, the client will trigger a listener for that specific event name; otherwise, a general  onmessage event will be triggered, which can be used to classify and process different types of messages.

Example:

event: userJoined
data: Alice

id

`id` : Event ID, used to assign a unique ID (string) to the message. If the connection is interrupted, the last received ID will be automatically sent in the HTTP request header when the client reconnects. This can be used to implement idempotency and resumable interrupted downloads . Last-Event-ID 

Example:

id: msg-123data: This is an important message

retry

`retry` : Reconnection time. This indicates the number of milliseconds the browser should wait before retrying the connection after it has been lost . Since this is not a mandatory command , the browser may ignore it. It is used to prevent the client from retrying too frequently when the server experiences a failure.

Example: Tell the browser to wait 10 seconds before trying to reconnect if the connection fails.

retry: 10000

Let’s look at a simple example of the use of the SSE protocol.

SSE Usage Examples

Backend API:

@Slf4j 
@RequestMapping("/sse")
@RestController
public class SseController {
@RequestMapping("/end")
public void end(HttpServletResponse response) throws IOException, InterruptedException {
log.info("Request initiated: event");
response.setContentType("text/event-stream;charset=utf-8");
PrintWriter writer = response.getWriter();
for (int i = 0; i < 10; i++) {
// Event foo Event
String s = "event: foo\n";
s += "data: " + new Date() + "\n\n";
writer.write(s);
writer.flush();
Thread.sleep(1000L);
}
// Define the end event, indicating the end of the current stream
writer.write("event: end\ndata: EOF\n\n");
writer.flush();
}
}

Front-end implementation:

<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>SSE</title>
</head>
<body>
<div id="sse"></div>
<script>
let eventSource = new EventSource("/sse/end");
eventSource.addEventListener("foo", function(event) {
console.log(event);
document.getElementById("sse").innerHTML = event.data;
});
eventSource.addEventListener("end", function(event) {
console.log("Connection closed")
eventSource.close();
});
</script>
</body>
</html>

Run the program and observe the logs printed in the console.

You can see that the message was successfully transmitted and the connection was closed after the message transmission was complete.

In Spring,  the SSE protocol can be elegantly implemented  using  WebFlux , which is the Flux we used earlier . It is  the core component of WebFlux  . Let’s look at  the usage and common operations of Flux.

Flux

Flux workflow: Create → Transform → Filter → Consume

Create Flux :

import reactor.core.publisher.Flux; 

// 1.1 Create Flux from a fixed value
<String> fixedFlux = Flux.just("Hello", "World", "!");

// 1.2 Create a List from a collection
<String> list = Arrays.asList("A", "B", "C");
Flux<String> fromCollection = Flux.fromIterable(list);

// 1.3 Create a range of values
​​Flux<Integer> rangeFlux = Flux.range(1, 5); // 1,2,3,4,5

// 1.4 Dynamically generate
Flux<Long> intervalFlux = Flux.interval(Duration.ofSeconds(1)).take(5);

// 1.5 Create
a Flux from an array<String> arrayFlux = Flux.fromArray(new String[]{"X", "Y", "Z"}));

// 1.6 Create an empty stream
Flux<String> emptyFlux = Flux.empty();

// 1.7 Create from a Future (adapting to traditional asynchronous APIs):
`CompletableFuture<String> future = CompletableFuture.supplyAsync(() -> "Result");`

`Flux<String> futureFlux = Flux.fromFuture(future);`

Conversion operation :

// 2.1 map - One-to-one conversion 
Flux<String> original = Flux.just("apple", "banana", "cherry");
Flux<String> uppercased = original.map(String::toUpperCase); // APPLE, BANANA, CHERRY

// 2.2 flatMap - One-to-many conversion (asynchronous flattening)
Flux<String> words = Flux.just("hello world", "spring ai");
Flux<String> splitWords = words.flatMap(word ->
Flux.fromArray(word.split(" "))
);

// 2.3 cast - Type conversion
Flux<Object> objects = Flux.just("text1", "text2");
Flux<String> strings = objects.cast(String.class);

// 2.4 scan - Cumulative calculation
Flux<Integer> numbers = Flux.range(1, 4);
Flux<Integer> cumulativeSum = numbers.scan((acc, current) -> acc + current); // 1,3,6,10

Filtering operation:

// 3.1 filter - Conditional filtering 
Flux<Integer> allNumbers = Flux.range(1, 10);
Flux<Integer> evenNumbers = allNumbers.filter(n -> n % 2 == 0); // 2,4,6,8,10

// 3.2 distinct - Removing duplicates
Flux<String> withDuplicates = Flux.just("A", "B", "A", "C");
Flux<String> uniqueItems = withDuplicates.distinct()); // A,B,C

// 3.3 take - Taking the first N
Fluxes Flux<String> limited = original.take(2); // apple, banana

// 3.4 skip - Skipping the first N
Fluxes Flux<String> skipped = original.skip(1); // banana, cherry

// 3.5 takeWhile / skipWhile - Conditional taking/skipping
Flux<Integer> sequence = Flux.range(1, 100);
Flux<Integer> firstPart = sequence.takeWhile(n -> n < 10); // 1,2,3,...,9

// 3.6 sample - sampling (periodically taking the latest element)
Flux<Long> sampled = Flux.interval(Duration.ofMillis(100)))
.sample(Duration.ofSeconds(1)))
.take(3); // Sample every 1 second, for a total of 3 times

Consumption Operation :

// 4.1 subscribe - The most basic consumption method 
Flux<String> data = Flux.just("one", "two", "three");
data.subscribe(
item -> System.out.println("Received: " + item),
error -> System.err.println("Error: " + error)),
,
() -> System.out.println("Completed!"))
);

// 4.2 collectList - Collect all elements into a List
Mono<List<String>> listMono = data.collectList());

// 4.3 blockFirst / blockLast - Blocking fetching (for testing purposes only)
// String first = data.blockFirst();

// 4.4 reduce - Reduction operation
Mono<Integer> sum = numbers.reduce(0, Integer::sum));

// 4.5 count - Counting
Mono<Long> count = data.count();

// 4.6 hasElement - Check if there is an element
Mono<Boolean> hasData = data.hasElements();

// 4.7 then - Ignore elements, only trigger
Mono<Void> completionSignal = data.then();

Advisors

Advisors are an interceptor mechanism in Spring AI that allows us to inject custom logic into specific nodes of the AI ​​call chain.

Advisors intervened at two key moments:

Before Call : Executed before the request is sent to the AI ​​model , primarily used to modify prompt words.

After Call: Executed after receiving the AI ​​response but before returning it to the client.

Its execution process is as follows:

User input → Advisor1.before() → Advisor2.before() → AI model call → Advisor2.after() → Advisor1.after() → Final response

Spring AI includes several built-in Advisors , such as  SimpleLoggerAdvisor , whose main function is logging. Simply adding it to the Advisor chain will automatically record the Advisor’s chat requests and responses.

@RequestMapping("/chat") 
@RestController
public class ChatClientController {
private ChatClient chatClient;
public ChatClientController(ChatClient.Builder chatClientBuilder) {
this.chatClient = chatClientBuilder
.defaultSystem("Your name is Xiaoxiaoyu, a professional intelligent Q&A AI assistant, proficient in Java and Python, answering questions with a friendly attitude")
.build();
}

@GetMapping("/advisor")
public String advisor(String userInput) {
return this.chatClient.prompt()
.advisors(new SimpleLoggerAdvisor())
.user(userInput)
.call()
.content();
}

record Recipe(String dis, List<String> ingredients) {}
}

We configured the log level to  debug  for observation:

logging:
  level:
    org.springframework.ai.chat.client.advisor: debug

Observe the printed log content