本文最后更新于 197 天前，其中的信息可能已经有所发展或是发生改变,如有失效可到评论区留言。

Article Summary

文章探讨了多模态AI在文字生成图像、视频、语音、音乐等领域的技术原理与应用实践，梳理了DALL-E、Stable Diffusion、Midjourney等代表性模型的生成机制及适用场景，并针对文字生成类工具的使用痛点，介绍了chatgpt-web-midjourney-proxy的部署方案与实操体验。通过分析多模态数据融合的挑战与解决方案，展示了AI在创意设计、教育、广告等场景中的潜力，同时验证了开源工具在降低技术门槛、提升生成效率方面的实际价值，为多模态AI的落地应用提供了实践参考。

Qwen3-14B · 2026-06-18

Preface

As I get more and more exposed to the field of AI, some of my previously vague understandings are gradually becoming clearer, and the needs have gradually transitioned from the simplest chat-type usage to text generation (text-to-picture, text-to-speech, text-to-video, text-to-code, etc.).

If we further break down AI requirements based on this, we can more precisely categorize them into the following levels (these levels are based on the different requirements for accuracy, subjectivity, and execution in application scenarios, as well as the user's trust in AI):

Level 1: Information Query and Fuzzy Answers

This level of AI requires quick delivery of vague information or recommendations, similar to a search engine alternative. It's primarily used in scenarios like daily Q&A and encyclopedia searches, where users only need general, actionable answers rather than precise content, such as travel recommendations or recipe suggestions. These requests typically don't require rigorous verification unless they involve health or sensitive areas.

Level 2: Open content generation (high subjectivity, low accuracy)

At this level, AI is primarily used to generate user-preferred content based on user-supplied text, satisfying subjective needs rather than precise standards. Typical applications include image generation and text creation. As long as the output meets the user's aesthetic or emotional preferences, objective accuracy is not a concern. For example, generating artistic images, videos, or written stories based on text is essential.

Level 3: Runnable task generation (basic accuracy requirements)

This level of AI requires the generation of executable content with basic functionality for direct machine or human use, such as code generation or preliminary scripts for data analysis. The generated content must pass basic testing and meet certain functional requirements, but absolute accuracy is not required. Therefore, tasks like code generation must meet certain operational standards, but they do not need to be completely error-free.

Level 4: High-precision generation and decision support (high accuracy and strict logic)

This level requires AI to generate rigorously accurate and logically sound content, particularly in rigorous fields such as medicine and law. Applications include generating medical reports and drafting legal provisions. The generated content must achieve a high level of accuracy and logic, and often requires further human verification to ensure there are no logical errors or ambiguities.

Level 5: Complex, multi-step task execution (automated, multimodal tasks)

This level involves AI executing multi-step, multimodal tasks while maintaining logical consistency throughout the process. Applications such as intelligent customer service, virtual assistants, and autonomous driving require AI to integrate data inputs from various modalities, such as voice, images, and text, and respond precisely based on the context. This requires AI to maintain high accuracy and consistency across successive steps.

Level 6: Real-time, dynamic decision-making and adaptive tasks

At this level, AI needs to be able to adapt to environmental changes in real time and perform complex, unstructured tasks. Typical applications include real-time route adjustment in autonomous driving or intelligent security monitoring. AI not only needs to dynamically perceive the environment but also strike a balance between high precision and rapid response to ensure reliable and secure decision-making.

Level 7: Fully autonomous multimodal collaboration and creative tasks

The highest-level AI requirement is the ability to autonomously perform multimodal collaboration and creative tasks, such as generating complex design solutions and proposing scientific experiments. These tasks require AI to possess a high degree of creativity and self-evaluation, enabling it to act as a "collaborator." While ensuring logical rigor, AI must also be innovative, such as in automating scientific research experiment design or generating new drug molecules.

This stratification of AI usage from "simple to complex" and "from vague to precise requirements" coincides with the requirements for different tasks in actual applications: it can help us understand how the functions and output requirements of AI change in different scenarios.

Note 1: Levels 4 to 7 are advanced applications of AI, which are generally not accessible to ordinary people. Therefore, levels 1 to 3 are the scenarios we encounter most often.

Note 2: Starting from level 2, an important concept is involved: the multimodality of AI.

Multimodality of AI

FBI warning: This part is relatively boring, but it is essential from the perspective of the article structure. Friends who are not interested in these technical details can jump directly to the practical part at the end.

What is multimodality in AI?

Multimodal AI refers to the ability of AI systems to understand, process, and generate multiple types of data, such as text, images, audio, and video, and to establish connections between different data modes. For example, a multimodal AI can generate an image based on a text description (text-to-image) or generate a text description of a scene in a video (video-to-text). This capability makes AI more flexible in handling complex and diverse tasks and can more naturally simulate human multi-sensory information processing.

The importance of multimodality and its difficulties in implementation

The importance of multimodal AI lies in its ability to more comprehensively understand the information provided by humans. Humans often rely on multiple modes of communication, such as language, facial expressions, body movements, and voice, to convey information through multiple channels. Multimodal AI can integrate these different data sources to form a deeper understanding of information, which is particularly important in the following aspects:

1. Enhance comprehensionBy integrating information from different modalities, AI can more accurately understand context. For example, in autonomous driving, AI systems need to simultaneously process visual (road images), acoustic (alarms or sounds of surrounding vehicles), and radar sensor data to make safer decisions.

2. Improve generation capabilitiesMultimodal AI can generate content from one mode of communication to another, such as generating an image from text or a description from an image. This capability is widely used in creative content generation, writing assistance, design, and other fields, helping people realize their creative ideas more easily.

3. Achieve more natural human-computer interaction: Multimodal AI makes human-computer interaction closer to communication between humans. For example, in virtual assistants, by combining voice, image and text understanding, AI can understand user needs more comprehensively and provide more practical answers.

The difficulty in implementing multimodal AI lies in the fact that different data modes are inherently different and require different processing methods. Image data and text data have completely different characteristics and structures, making it a major challenge for AI to learn to relate them. Furthermore, effectively integrating information from different modalities to avoid information loss and bias is also a key topic in multimodal AI research.

Typical applications of multimodal AI

Multimodal AI applications are enriching human-computer interaction, making AI more flexible and intelligent in handling diverse tasks. They are also bringing new possibilities to industries such as artistic creativity, film and television production, and content creation. Common multimodal AI applications include the following scenarios.

1. Text generation imageMultimodal AI generates images based on user-entered text descriptions. For example, if a user enters "beach scenery at sunset," the AI generates a matching image. Representative models include DALL-E, Stable Diffusion, and Midjourney, which are widely used in creative design, digital art, and advertising.

2. Text-generated videoIn this application, AI generates video clips based on text descriptions. For example, if you input "sunrise in the forest," the AI can generate a video of the sun gradually rising and illuminating the forest. This application is currently in its early stages, but it has great potential in areas such as entertainment, advertising, and education. Imagen Video and Meta's Make-A-Video are cutting-edge research models in this field.

3. Text-to-speechThis application generates natural language speech output based on text descriptions. Typical applications include virtual assistants (such as Siri and Alexa) and text-to-speech assistive technologies. Users input text, and AI generates the corresponding speech, such as News Anchor, which is used to create news reports. This application is common in scenarios such as customer service, smart devices, and assistive tools.

4. Text-generated musicMultimodal AI generates contextually appropriate music based on text descriptions or emotional cues. For example, a user can input "warm and relaxing background music," and the AI will generate a corresponding melody. For example, Google's MusicLM learns the association between audio and text to generate music clips that match the user's description. This is suitable for short videos, podcast background music, and creative music production.

5. Image generation text description (image annotation)AI generates text descriptions or tags based on image content to help users understand the image. For example, on social media platforms, it automatically generates descriptions for images. In image search, text descriptions make it easier for users to find relevant images. Representative models include OpenAI's CLIP and Google's Imagen.

6. Image-to-audioAI generates audio effects based on images, such as generating the sound of a thunderstorm from a thunderstorm scene. This technology is extremely useful in augmented reality (AR) and virtual reality (VR), making virtual scenes more immersive. It can also be applied to film special effects and multimedia content production.

7. Video text description generation (video annotation)AI analyzes video content and generates summary descriptions or frame-by-frame text annotations to facilitate content search and understanding. For example, this can generate automatic subtitles for news and documentaries, or provide detailed audio descriptions for accessible videos. Related models include Meta's VideoCLIP.

8. Audio to textThis application forms the foundation of speech recognition systems. AI converts audio content into text, such as real-time transcription of speeches and meeting content. It is widely used in scenarios such as voice assistants, video subtitle generation, and telephone customer service. Typical systems include Google's Speech-to-Text and Apple's Siri.

9. Audio-to-ImageAI generates images or visual effects based on audio content, such as generating dynamic images from music or generating scene graphs based on audio cues. This technology is often used in music video production and sound visualization to enhance the expressiveness and artistry of audio.

10. Comprehensive generation of multimodal informationAI combines information from different modalities (such as text, images, and audio) to generate new content or provide complex services. For example, autonomous driving AI systems process multimodal data, including visual, radar, and audio, to make decisions. Virtual assistants combine user input with contextual images to provide customized responses. This type of application is particularly well-suited for complex scenarios, such as smart security and intelligent customer service.

Note: Since there are too many multimodal application areas, this article only summarizes the areas that I am most interested in: "text-to-image", "text-to-video", "text-to-speech", and "text-to-music".

Text to image

In the field of multimodal AI, "text-to-image generation" allows AI to generate images based on user-provided text descriptions. This is commonly used in creative, design, advertising, and entertainment scenarios. In recent years, this field has experienced rapid development, with the emergence of many representative models, such as OpenAI's DALL-E and Stability AI's Stable Diffusion and Midjourney. A brief introduction to this field is as follows:

1. Main Models and Technologies

• DALL-EThe DALL-E series of models developed by OpenAI is a pioneering project in this field. Using the Transformer model from deep learning, DALL-E is trained on a large dataset of image-text pairs, enabling it to generate semantically consistent images based on input descriptions. DALL-E 2 and DALL-E 3 achieved significant improvements in generated image quality, detail, and diversity. DALL-E 3 also further optimizes its understanding and detailed control of user text descriptions.

• Stable DiffusionStable Diffusion is an open-source system based on a diffusion model that generates images using a diffusion process. It works with a variety of inputs, such as images and text, and offers strong user customization when generating images. Stable Diffusion's open-source nature has made it popular in the creative community, allowing users to customize the model and even fine-tune it to generate images in a specific style or theme.

• MidjourneyMidjourney is an AI platform focused on image generation, primarily offered through an online community and Discord server. Renowned for its artistry and unique style, Midjourney is well-suited to creative and artistic expression. The images it generates often possess a unique aesthetic, earning them widespread recognition among designers and artists.

Both DALL-E and Stable Diffusion can be called through some mainstream local large language model UIs (such as Lobechat):

For Stable Diffusion, due to its open source nature, it also supports various local deployment methods.

By default, Midjourney can only be accessed through Discord (Discord is an instant messaging and VoIP social platform that allows communication through voice calls, video calls, text messages, and media). Users must join the Discord community to generate images, and there is no official API or local deployment option. Of course, the power of netizens is endless: there are open source projects that use proxies to enable calling midjourney from other user interfaces, which I will discuss later.

2. Technical Principle

• Transformer modelModels like DALL-E rely on a transformer model to generate images by encoding the semantic relationship between text and images. The transformer model uses a self-attention mechanism to capture contextual information in the text description and then maps it to the image space to generate the corresponding image.

• Diffusion ModelsStable Diffusion uses the principles of a diffusion model to gradually transform random noise into a clear image. By continuously reducing noise, the model gradually creates an image that matches the text description. This model excels at generating clear, detailed images, and the generation process is both stable and flexible.

Note: Midjourney's technical details have not been publicly disclosed, so its specific architecture is still unclear. However, judging by its generation results and some known features, Midjourney may not employ traditional Transformer models (such as DALL-E) or standard diffusion models (such as Stable Diffusion). Its generation technology may be based on some kind of adaptive generation model, representing an innovative multimodal generation approach that integrates diffusion models, style transfer, and other techniques to achieve unique artistic effects.

3. Application Scenarios

• Advertising and MarketingText-generated images are proving increasingly popular in advertising creatives. Marketers can use these models to quickly generate images that align with brand values, enhancing visual impact and reducing production costs.

• Games and Film: In industries such as games, film and television that require a large amount of visual materials, AI-generated images can help designers quickly generate scene concept maps or character designs, improve efficiency and explore various visual styles.

• Social media and self-media creation: Many self-media creators use generation tools such as DALL-E and Midjourney to create unique visual content, enhancing the appeal of social media content.

4. Relevance Technology: Image Generator

Image-to-text, also known as image captioning, uses AI to automatically analyze image content and generate text descriptions. It's widely used in scenarios such as image search, automated annotation, and assisting the blind. Representative models include OpenAI's CLIP, Google's Imagen Captioning, and Microsoft's Azure Computer Vision. Image-to-text technology typically uses deep neural networks to extract image features and then uses a text generation module to generate natural language descriptions that match the image content. This technology is particularly suitable for rapidly understanding and indexing images.

Generating text from images (image description) and generating images from text (image generation) are closely related in the field of multimodal AI, mainly reflected in the following aspects:

• Shared multimodal representation spaceMany advanced models (such as CLIP and DALL-E) represent images and text in the same multimodal space, creating a semantic connection between images and text. This connection enables the model to not only generate image descriptions (image-to-text) but also generate semantically consistent images based on text. This universal multimodal representation is the foundation for both tasks.

• Mutual support of training dataModels that generate text from images and images from text are typically trained using image data with text labels. This data helps the model learn the correspondence between images and text, understanding both the image content and the text description. This data interoperability allows the model to improve performance on one task (e.g., more accurate image description) by leveraging it in the other (e.g., generating images that better fit the description).

• Complementarity of application scenariosIn practical applications, image-to-text and text-to-image are often used in conjunction. For example, in e-commerce, image-to-text can automatically generate product descriptions, while text-to-image can generate renderings or design examples based on product descriptions. This complementary approach not only improves content production efficiency but also allows the two models to work together to enrich the user experience.

• Two-way enhancement of understanding and production capabilitiesGenerating text from images allows AI to better understand image content, while generating images from text allows AI to conceive images from descriptions. Combining these two approaches helps enhance the model's vision-language bidirectional understanding and generation capabilities, enabling the model to not only generate descriptions from image recognition but also generate visual content from descriptions, enabling more natural multimodal interaction.

Through these associations, generating text from images and generating images from text have formed a two-way complementary relationship in technology and application, jointly promoting the development of multimodal AI.

Text-generated video

In the multimodal AI field of "text-to-video," AI generates short videos tailored to user-provided text descriptions. While this technology is still in its developmental stages, it has already demonstrated significant potential and is gaining widespread attention in fields such as entertainment, advertising, and education. Key representative models include Meta's Make-A-Video and Google's Imagen Video. The following is a detailed introduction to this field.

1. Main Models and Technologies

• Make-A-VideoMake-A-Video, launched by Meta, is one of the most advanced text-generated video models available. This model excels at generating dynamic visual content, creating video clips with smooth transitions and detailed details based on input text descriptions. By using pre-training on image-video pair datasets, combined with generative adversarial networks (GANs) and diffusion models, Make-A-Video generates more natural-looking video content.

• Imagen VideoImagen Video, launched by Google, is a multimodal AI system that generates videos based on text descriptions. Imagen Video utilizes a diffusion model architecture similar to Stable Diffusion and combines it with text encoding techniques to generate higher-resolution, more layered videos. The model can generate multiple scene elements and dynamic changes, making it suitable for generating short-duration video content.

• LumaLuma Labs specializes in high-quality 3D content generation, using text descriptions to generate realistic 3D videos. This model is particularly well-suited for generating scenes requiring fine visual detail and is widely used in virtual reality, augmented reality, and film and television production, delivering highly realistic visuals.

• RunwayRunway's Gen-2 model supports text-based multimodal video generation, generating short videos from simple text descriptions. Runway excels at generating creative content and short videos, providing high-quality special effects and creative tools that are widely used in creative content production, advertising, and short video platforms.

• Pika LabsPika Labs provides a fast video generation service, capable of generating rich video clips from short text descriptions. The Pika model offers significant advantages in both speed and quality, making it suitable for scenarios requiring rapid video content generation, such as social media creation and e-commerce advertising.

• KleinThe Keling model is a Chinese multimodal generative model focused on generating videos from text in the Chinese context. This model can generate realistic video content based on Chinese descriptions. It is suitable for short video creation, advertising, and educational content generation in China, and is particularly well-suited to the creative needs of Chinese-speaking users.

These models have different focuses on video generation performance, ranging from realistic 3D videos to quickly generated creative short videos, and can bring innovative solutions to multimodal content creation.

Currently, there are significant differences in the usage of models in the field of text-to-video generation: some models provide open APIs (or are not public, but have been analyzed and successfully simulated by third parties), allowing integration into a unified interface through third-party tools or proxy services for calling, thereby supporting use in local or custom interfaces. However, more models do not provide public API interfaces and are limited to use on officially provided platforms or dedicated applications. This difference in design makes the flexibility and integration of each model different: models that support APIs are suitable for flexible integration in multiple scenarios, while models that are limited to dedicated interfaces rely on official platforms and are generally suitable for standardized applications.

Of course, the power of netizens is still infinite, which I will talk about later.

2. Technical Principle

1. Diffusion Models：Diffusion models have gradually become the mainstream choice in the field of video generation, especially in generating high-quality and coherent video content. The basic principle of the diffusion model is to encode the video content by gradually adding noise, and then reversely generate the denoising process to generate content that meets the description frame by frame. For example,RunwayGen-2 model,Imagen VideoandPika LabsBoth use a diffusion model to generate high-quality video clips by gradually reducing noise. Due to its gradual generation method, the diffusion model can outperform traditional generative adversarial networks in preserving details and stability.

2. Generative Adversarial Networks (GANs): GANs was once one of the main technologies for image and video generation, using adversarial training between generators and discriminators to generate realistic video content.Make-A-VideoThe model may have incorporated a GAN structure into its framework, ensuring that each generated video frame is both coherent and maintains a high degree of visual realism. However, the challenge with GANs is that they are prone to instability when generating high-resolution videos, especially when generating long consecutive frames, which is complex. As a result, they have been gradually surpassed by diffusion models in the field of pure video generation.

3. Autoregressive ModelsSome models generate video content frame by frame or segment by segment through autoregressive generation. Autoregressive models rely on information from the previous frame or segments to generate video content sequentially. They are particularly suitable for generating short videos or videos on specific topics.LumaThe model may introduce an autoregressive generation method in 3D scene generation, making the video content highly consistent and clear in 3D rendering details.

4. Multimodal Embedding: Many video generation models, includingKelingKlein, both use multimodal feature embedding technology to map sentiment, style, and scene characteristics from text descriptions to video features. This approach allows the model to generate content that meets cultural and linguistic requirements based on the Chinese context, improving the alignment between generated videos and text descriptions. Klein focuses on the needs of Chinese users and improves the emotional expression and thematic consistency of generated content by optimizing the mapping from Chinese text to video.

3. Application Scenarios

• Advertising and Marketing：Text-generated videos have significant applications in the advertising field. They can quickly generate short video clips that match the brand tone, reduce advertising production costs, and provide rich visual content creativity.

• Social Media and Content Creation: Self-media and short video creators can use these tools to generate short video materials, such as simple scene switching or background effects, to increase the attractiveness and diversity of the content.

• Education and trainingIn education, text-generated videos can help explain abstract concepts and generate explanatory, dynamic content that facilitates understanding. For example, animations of scientific experiments and physics processes can be generated from text descriptions, helping students better understand complex concepts.

4. Relevance Technology: Video-Generated Text

Video-generated text, also known as video description generation, is a technology that uses AI to automatically analyze video content and generate text descriptions. It is widely used in scenarios such as video search, automated subtitle generation, and content review. Representative models include Google's VideoBERT, Microsoft's Azure Video Indexer, and Meta's VideoCLIP. Video-generated text technology typically uses deep neural networks to extract temporal and visual features of videos, then uses a text generation module to generate natural language descriptions that match the video content. It is particularly suitable for scenarios requiring rapid video understanding and indexing. Future development directions include multi-level plot descriptions, cross-language subtitle generation, and higher semantic accuracy.

Generating text from video (video description generation) and generating video from text (text-driven video generation) are closely related in multimodal AI, mainly reflected in the following aspects:

1. Shared multimodal understanding and generation capabilitiesBoth techniques require the model to understand the deep connections between video and text. Generating text from video requires converting visual and temporal information into linguistic descriptions, while generating video from text requires generating visual and temporal information consistent with the text descriptions. This bidirectional conversion relies on the multimodal model's shared understanding of video and text, enabling the model to both parse content and generate semantically consistent representations.

2. Bidirectional support of datasetsGenerating text from video and generating video from text often use similar datasets, especially datasets of videos with text descriptions. These datasets contain paired relationships between videos and their text descriptions, enabling models to simultaneously learn to generate descriptions while also mastering the mapping from text to video, thus promoting the mutual development of the two tasks.

3. Mutual enhancement of application scenariosVideo-to-text and text-to-video can be combined in many application scenarios. For example, in content creation, users can quickly generate preliminary images using text-to-video, and then use the automatic description generated by video-to-text to review and refine the content. Furthermore, in fields such as education and news, text-to-video can generate visual content, while video-to-text can generate subtitles and descriptions to facilitate content dissemination and understanding.

4. Complementarity of cross-domain applicationsGenerating text from video can help improve the accuracy of text-generated videos. For example, when generating videos with complex plots, the descriptions generated by the video-to-text module can be used to automatically proofread or optimize the generated video, ensuring consistency between the text and the video content.

Through these associations, video-to-text and text-to-video generation promote each other, forming a complementary relationship in technology and application, enabling multimodal AI to more comprehensively support complex video-text interaction needs.

Text-to-speech (TTS)

In the multimodal AI field of "text-to-speech," AI generates natural and fluent speech based on user input. This technology has been widely used in voice assistants, customer service systems, education, and other fields, and can generate customized voice content with varying timbre, speed, and emotion. Representative models include Google's Tacotron, OpenAI's Whisper, and Microsoft's Azure TTS. The following is a detailed introduction to this field, along with a brief introduction to "speech-to-text" technology.

1. Main Models and Technologies

• TacotronTacotron, launched by Google, is an early end-to-end text-to-speech model. It converts text into natural speech waveforms and can even synthesize specific tones and intonations. Tacotron 2 further optimizes sound quality and stability, laying the foundation for subsequent TTS models.

• Azure Text-to-Speech (TTS)Microsoft's Azure TTS provides a flexible text-to-speech service that supports synthesized speech output in multiple languages, accents, and emotions. Users can adjust the voice style to suit the needs of different industries, such as intelligent customer service and educational broadcasting.

Note: The most commonly used text-to-speech method is TTS, which can be called directly in the local large model UI

2. Technical Principle

• Sequence-to-Sequence (Seq2Seq) ModelTacotron and other text-to-speech models are typically based on a Seq2Seq architecture, encoding text into speech features and then decoding them into audio waveforms. These models use deep neural networks to learn the correspondence between text and speech.

• Waveform generation technologyTacotron 2, Whisper, and Azure TTS use advanced waveform generators (such as WaveNet or WaveGlow) to produce more natural and clear speech. These waveform generators use high-frequency sampling and filtering to make speech synthesis closer to real human voices.

3. Application Scenarios

• Virtual assistants and smart devicesText-to-speech is widely used in virtual assistants such as Alexa and Siri. AI can generate natural speech based on user commands, making smart devices interactive.

• Telephone customer service and service robotsMany companies are using AI voice for customer service, improving service efficiency by generating natural-sounding speech. The generated speech can mimic a real person's accent and intonation, improving the customer experience.

• Education and mediaText-to-speech is used in education to automate broadcasting and explanation, and is suitable for language learning, e-book reading, and course audio production. Speech generation technology is also widely used in the media industry to automatically generate news broadcasts or audio content.

4. Related Technology: Automatic Speech Reduction (ASR)

Speech-to-text technology transcribes spoken audio content into text. It is widely used in scenarios such as voice assistants, subtitle generation, and telephone customer service. Representative models include Google Speech-to-Text, Whisper, and Microsoft's Azure ASR. ASR technology typically uses deep neural networks to analyze and recognize audio features, accurately transcribing speech content into text. It is particularly suitable for voice input scenarios that require text recording. Future development directions include multilingual support and improved adaptability to noisy environments.

Note: Whisper is a model released by OpenAI that focuses on automated speech recognition (ASR) tasks. Trained on a large dataset of multilingual and multi-context data, it boasts high speech recognition capabilities and is particularly well-suited for scenarios requiring accurate semantic representation, such as language learning and broadcasting. It's important to note that Whisper doesn't generate speech, but rather serves as the core technology for speech recognition.

In many voice interaction scenarios, Whisper and TTS are often used together: Whisper is responsible for speech recognition, converting user speech into text. The AI model understands the user's needs based on the text content, and the TTS model then converts the response into speech output. This process creates a complete closed loop from speech to text and back to speech, and is suitable for applications such as voice assistants and customer service systems. Take Lobechat's "Voice Service" as an example:

ASR (Analytics Voice Response) and Text-to-Speech (TTS) are closely related in terms of technology and application. Together, they form a "bidirectional speech-to-text conversion system":

• Build a complete voice interaction closed loopASR and TTS complement each other, providing a complete interactive experience for applications such as voice assistants, customer service systems, and smart devices. Users input speech, which the system converts into text using ASR. After understanding the intent, TTS synthesizes a natural-sounding response, enabling full-duplex voice communication. Smart assistants like Siri and Alexa rely on this closed-loop interaction, allowing users to complete actions without text input.

• Sharing voice and text features and training dataASR and TTS technologies are typically trained using similar speech-to-text paired data. In multimodal AI models, the mapping between speech and text is crucial. Using the same or similar datasets, ASR can generate accurate text from speech, while TTS uses that text to generate appropriate speech. The use of bidirectional training data improves model accuracy and consistency.

• Improve the naturalness and intelligence of human-computer interactionThe combination of ASR and TTS makes the human-computer interaction experience more natural and coherent. For example, in customer service robots, user voice requests can be accurately transcribed and understood, and natural voice responses can be provided through TTS, making the interaction more similar to human conversation. In the future, with the development of multimodal AI models, the integration of ASR and TTS will enable devices to more intelligently perceive and understand user emotions and intentions.

• Facilitating cross-lingual and multimodal translation and transcriptionIn cross-language applications, ASR and TTS technologies can be combined to create a speech translation system. For example, after a speech is transcribed into text using ASR, the text can be converted into another language using a translation module, and then TTS can be used to generate speech output in the target language. This process supports cross-language communication and can also help people with hearing impairments communicate more easily between different languages.

By integrating speech-to-text and text-to-speech into a multimodal AI system, not only is the application scenario of the two technologies expanded, but a smarter and more flexible two-way interactive experience can also be built, providing more humane services for voice assistants, translation, customer service and other fields.

Text-generated music

In the field of multimodal AI's "text-to-music" approach, AI can generate music that matches specific moods and styles based on user input text or descriptions. This technology has applications in music creation, game sound effects, film soundtracks, and other scenarios, providing personalized music generation for various occasions. Representative models include OpenAI's Jukebox, Google's MusicLM, and AIVA. The following is a detailed introduction to this field.

1. Main Models and Technologies

• JukeboxThe Jukebox model, launched by OpenAI, is a neural network-based music generation tool that can generate music in a variety of styles, including pop, rock, jazz, etc., based on text descriptions. Jukebox can generate songs with lyrics and even imitate the style of a specific singer.

• MusicLMGoogle's MusicLM model generates highly complex, emotionally engaging music from text descriptions. It can generate music clips of varying lengths and offers a high degree of control over musical style and emotional expression, making it suitable for diverse creative needs.

• BarkThe Bark model, developed by Suno, is a generative audio tool that generates a variety of audio types, including music, vocals, and environmental sound effects, based on text descriptions. Bark can generate musical clips with emotion and style, and supports multilingual speech generation, making it suitable for creating sound effects and rich musical content for different scenarios.

• AIVAAIVA is an AI model primarily used to generate background music and soundtracks. It can generate classical music and light music based on input mood and style descriptions. AIVA is widely used to generate film soundtracks and game music.

Note: In the field of text-to-music generation, OpenAI's Jukebox and Google's MusicLM are still experimental and not commercially available. Bark and AIVA are among the few commercialized music-generating AI platforms.AIVA It is a professional model focused on commercial music generation, suitable for commercial music projects such as advertising, film and television soundtracks, etc.Bark It is more inclined to diversified audio generation. Although it can also be used in commercial scenarios, its application in the field of professional music creation is relatively limited, and it is mainly used as a sound effect and voice generation tool.

2. Technical Principle

• Sequence-to-Sequence (Seq2Seq) ModelMost text-to-music models use a Seq2Seq architecture, encoding textual descriptions into feature vectors and then decoding them into corresponding audio sequences. These models leverage large amounts of text and music data to learn the relationships between features like emotion and style, thereby transforming textual content into music.

• Autoregressive generative modelsJukebox and MusicLM use an autoregressive generation method to generate music fragments by gradually predicting the music's spectrum or waveform. This method can more finely control the continuity and details of the music, making the generated music more consistent with human auditory habits.

• Emotional characteristics and music mappingTo achieve consistency between music and emotion, the model identifies the emotion of the input text and generates music that expresses the corresponding emotion. The model uses emotion classifiers or style classifiers to help match text with appropriate musical elements, generating music that aligns with the theme.

3. Application Scenarios

• Film, TV and game soundtracksText-generated music is widely used to generate background music for film, television, and games. Directors and developers can generate contextually appropriate music from simple text descriptions, significantly reducing the time it takes to create a soundtrack.

• Personalized music creation: Personalized music generated by AI can be used on social media and short video platforms. Creators can generate exclusive background music based on text descriptions to enhance the appeal of the content.

• Healing and meditation musicIn the fields of health and psychological therapy, text-generated music can generate soothing music tailored to the needs of meditation and healing scenarios, helping users relax and meditate. Users input a specific emotion or theme description, and the system generates matching music, helping to enhance therapeutic effectiveness.

• Generate sound effects in real timeIn interactive applications, text-generated music can be used to generate music that evokes specific emotions based on real-time input. For example, in virtual reality (VR) or augmented reality (AR) experiences, AI can generate dynamic music based on the user's current mood or context, enhancing immersion.

Future development directions for text-generated music include a high degree of control over specific styles and details, making AI-generated music more in line with the needs of creators, and supporting multilingual descriptions to further expand its application scenarios.

AI multimodal UI tool: chatgpt-web-midjourney-proxy

The troubles of using text generation tools

Regarding the various multimodal requirements of text generation mentioned earlier in the article (text generation into pictures, text generation into videos, text generation into voice, text generation into music, and in fact, text generation into code, but this is not considered a multimodal requirement, so I did not mention it), the biggest problem is actually how to use it conveniently.

Some of these multimodal requirements can be met through commonly used local large-scale model UIs (such as Lobechat), such as the text-to-image (OpenAI's dall-e-3 model) and text-to-speech (Azure TTS) mentioned earlier in this article. However, others require specialized methods. For example, when generating text from images, dall-e-3 and Stable Diffusion are insufficient, requiring the use of Midjourney. Similarly, when generating text from videos, models like Luma, Runway, and KeLing are required, forcing the use of these models to adhere to official guidelines. Using a single model is fine, but using multiple models can be a bit of a hassle, with multiple models being a bit confusing and tedious.

So, can we use these generated models in the same UI interface? Yes, it can be used, that is "chatgpt-web-midjourney-proxy".

Introduction to chatgpt-web-midjourney-proxy

The purpose of this project is to integrate ChatGPT with Midjourney to realize the function of generating pictures from text (project address:github.com/Dooy/chatgpt-web-midjourney-proxy). It mainly serves as a proxy interface, allowing users to access multiple API services through a unified UI, including OpenAI's ChatGPT, Midjourney's image generation, and other models (such as Suno's audio processing and Runway's video generation). In this way, users can create a customized front end for ChatGPT while supporting multiple back ends, such as Midjourney for image generation, and other APIs for multimodal tasks (such as audio to audio, image to video, and text to video):

Supports multiple cultural graph models such as MidJourney, dall-e, and IdeoGram. In addition to drawing, MidJourney also supports face swapping and image mixing:

The project provides flexible deployment methods, including Docker and serverless options, suitable for personal or server environments. In addition, it also supports Vercel deployment to facilitate cloud configuration.

chatgpt-web-midjourney-proxy deployment

Deployment in Docker mode

The docker run command format is as follows:

docker run --name chatgpt-web-midjourney-proxy -d --restart=always \ -p 6015:3002 \ ydlhero/chatgpt-web-midjourney-proxy

In the above command, you can actually pass the parameters directly-eTo specify the API address and key of the multimodal service provider (or the proxy built by a service provider) (for example, by adding the following code to point to the MJ proxy address:

-e MJ_SERVER=https://your-mj-server:6013 \ -e MJ_API_SECRET=your-mj-api-secret \

If you want to build your own MJ_SERVER proxy, you can use the following docker run format command to create it:

docker run --name mj6013 -d --restart=always \ -p 6013:8080 \ -e mj.discord.guild-id=discord service ID \ -e mj.discord.channel-id=discord service group ID \ -e mj.queue.timeout-minutes=6 \ -e mj.api-secret=abc123456 \ -e mj.discord.user-token=************ \ novicezk/midjourney-proxy:2.5.5

Similar ones include:

-e LUMA_SERVER=`https://your-luma-server:8000`  \
-e LUMA_KEY=your-luma-key  \
-e SUNO_SERVER=`https://your-suno-server:8000`  \
-e SUNO_KEY=you-suno-key \

Note: The LUMA_SERVER and SUNO_SERVER proxies also have their own methods of setting up, which I will not go into here. In fact, it is not necessary to complete the configuration of these environment parameters when creating the docker. You can also log in to the UI interface to set them up after the setup is completed.

After docker deployment is completed, use it directlyhttp://host IP:6015Make a visit.

Vercel one-click deployment

For those who use Vercel, you can directly click the following link to deploy it with one click:Vercel one-click deployment, the relevant environment variables are described as follows:

Serverless Personal Desktop Installation

This method is to use local installation instead of deployment through docker or vercel (download link:Serverless Personal Desktop Download), the latest version is "v2.21.9":

Use of chatgpt-web-midjourney-proxy

Prerequisite knowledge: Third-party API provider: OpenAi-HK

In the previous article introducing the use of Lobechat (see:Home Data Center Series Unlock the full potential of Lobechat: A complete guide from setup to actual use), I mentioned a cost-effective third-party API provider: OhMyGPT. If you only look at the needs of Chat and text generation code, OhMyGPT is completely sufficient.

However, when it comes to text-to-image generation, OhMyGPT only supports OpenAI's dall-e-3, which seems a bit weak, not to mention the need for text-to-video and text-to-music generation. Therefore, if there are more requirements for text-to-image and video generation functions, then OhMyGPT is not very suitable. At this time, OpenAi-HK, a third-party API provider that supports GPT4.0, dall-e-3, GPTs Multimodal, Claude, Midjourney Flex Ideogram for drawing, Suno for music, Luma Runway for video, and Viggle for dance, becomes a good choice (openai-hk official website address):

As can be seen from the two pictures above, OpenAi-HK does have great advantages. Not only does it support many types of multimodal models, but the key is that it does not require scientific Internet access and has a low recharge threshold. In addition, there is no time limit on its use, and it can be used until it is used up. This is still very friendly.

After recharging, record your API key in the API KEY section of the console and choose the API address that has the fastest network speed for you:

Note: In order to take screenshots for this article, I recharged a huge sum of 10 RMB without hesitation.

chatgpt-web-midjourney-proxy actual experience

Access and set up chatgpt-web-midjourney-proxy

I deployed it using docker, so I usedhttp://host IP:6015To access:

Enter the settings interface in the lower left corner:

In the "Overview" section, you can set the link and name of the avatar image below, as well as the language and background image of the work interface:

In the "Model" section, you can set the chat model and parameters. Generally, gpt-4o-mini is sufficient. The number of contexts, number of replies, and role settings can be set as needed (for normal chat needs, I use Lobechat, so these settings are not very useful to me):

The "server-side" setting is the most important and is the key to the normal completion of multimodal text generation requirements:

After completing these settings, you can officially use chatgpt-web-midjourney-proxy to start multimodal operations such as text generation.

Text to image

In "chatgpt-web-midjourney-proxy", the models of Wensheng graph include Midjourney, dall-e-3 and IdeoGram.

DALL-E and Ideogram are AI models that generate images, but their design focus and application scenarios are different.

Developed by OpenAI, DALL-E can generate images in a variety of styles based on text descriptions, suitable for a wide range of needs, from real-world scenes to abstract art. However, DALL-E has certain limitations when generating specific, concrete characters (such as anime characters), primarily due to copyright and intellectual property considerations. To avoid overly similar characters, DALL-E tends to be conservative when generating such content, which may result in slightly blurred details or images that do not fully meet specific visual requirements.

Ideogram, on the other hand, is an AI tool that focuses on the fusion of text and images. It's ideal for generating image content containing text, such as text art, advertising design, or brand logos. Its strength lies in its ability to blend images and text, making it particularly effective for generating images containing clear text.

In contrast, Midjourney offers greater freedom in generating character images of specific styles, particularly those for anime or fictional characters with fantasy or personalized art styles. Midjourney can typically produce more stylized and detailed images, better matching the user's visual expectations when creating character images.

As you can see from the image below, there are a variety of options when selecting Midjourney:

The options of dall-e and IdeoGram are very simple in comparison:

IdeoGram:

Below, we use MidJourney as an example to generate a picture of "Android 18 wearing sexy clothes".

首先找一张人造人18号的模板照片，以下图为例：

然后按如下参数进行图片生成：

默认会生成4张图片，点击下方的U1、U2、U3、U4按钮，可以选择一张你最喜欢的进行单独生成。

再用相同的参数生成布尔玛的图片(七龙珠第一集中的形象为模板)：

这次生成的形象如下：

总的来讲，这种以一张模板图片为蓝本进行生成的操作倒是简单，实际效果也让人挺满意的。但是如果我想画一张人造人18号变身超级赛亚人3的图片应该怎么办？我如何告诉AI什么是超级赛亚人3？头大，等以后有时间再研究研究～。

注1：界面上还有其他一些参数和选项，比如”自传垫图”、”sref”等我都还没来得及研究，大家有兴趣可以尝试下。

注2：MidJourney还可以换脸和混图(两张图混在一起)，以换脸为例，假如把我的女神周慧敏和微胖界女神洪真英进行换脸，如下：

然后：

最终效果还行，可以称之为”洪慧敏”：

另外，之前生成的图片都在”画廊”选项里，如下图红框所示：

注：感觉有时候提交”生成图片”的请求会有失败的情况，这时候一般多提交2次就好，也不知道是不是OpenAi-HK到MidJourney之间的什么速率限制导致。

Text-generated music

该部分功能主要是通过Suno的bark模型来实现(固化的，不能自定义)，有2种模式，一种是描述模式，纯靠文字：

还有一种是定制模式，可以上传音乐样本：

最终生成的音乐：

大家可以试听一下，还是可以的：快乐的星期天.mp3.

Text-generated video

默认有4个模型：Luma、Runway、Pika、可灵，我以可灵为例生成视频：

感觉生成的视频不咋地，有点木楞，有点假，也不知道是不是我的提示词没掌握技巧，视频我就不上传了，大家有兴趣自己试试。

dance

这个其实类似于换脸，只不过这里是使用角色照片把视频模板里的人整个换掉：

视频懒得上传了，就拿一张截图看吧，生成了一个16秒的视频：

感觉很假～。

Summarize

最后还有个功能模块是”实时语音对话”，其实就是whisper和TTS的结合应用，这个Lobechat里也能实现，我就不折腾了。

总的来说，chatgpt-web-midjourney-proxy中最实用的是通过Midjourney进行文字生成图片(否则，正常使用MJ就要折腾Discord了)，至于其他的文字生成音乐、文字生成视频、舞蹈，个人感觉chatgpt-web-midjourney-proxy中提供的功能太简单了，只能当玩票性质，真要用，还是老老实实通过专业服务提供商官方推荐的方式使用吧。

📌 Content Structure Hints:

This content belongs to "AI Learning MapThis is part of the document; you can view the full content path here: AI Learning Map .

Share this article

Preface

Multimodality of AI

What is multimodality in AI?

The importance of multimodality and its difficulties in implementation

Typical applications of multimodal AI

Text to image

Text-generated video

Text-to-speech (TTS)

Text-generated music

AI multimodal UI tool: chatgpt-web-midjourney-proxy

The troubles of using text generation tools

Introduction to chatgpt-web-midjourney-proxy

chatgpt-web-midjourney-proxy deployment

Deployment in Docker mode

Vercel one-click deployment

Serverless Personal Desktop Installation

Use of chatgpt-web-midjourney-proxy

Prerequisite knowledge: Third-party API provider: OpenAi-HK

chatgpt-web-midjourney-proxy actual experience

Access and set up chatgpt-web-midjourney-proxy

Text to image

Text-generated music

Text-generated video

dance

Summarize

Send Comment Edit Comment

👋 Welcome to "Invincible Personal Blog"“