JP7240505B2

JP7240505B2 - Voice packet recommendation method, device, electronic device and program

Info

Publication number: JP7240505B2
Application number: JP2021538331A
Authority: JP
Inventors: 世▲強▼ 丁; 迪 ▲呉▼; ▲際▼洲 ▲黄▼
Original assignee: バイドゥオンラインネットワークテクノロジー（ペキン）カンパニーリミテッド
Priority date: 2020-05-27
Filing date: 2020-11-10
Publication date: 2023-03-15
Anticipated expiration: 2040-11-10
Also published as: KR20210090273A; CN113746874B; SG11202107217VA; US20230075403A1; WO2021238084A1; EP3944592B1; EP3944592A4; EP3944592A1; CN113746874A; JP2022538702A

Description

本発明は、２０２０年５月２７日に中国専利局に提出された出願番号が２０２０１０４６３３９８．８である中国特許出願に対して優先権を主張するものであり、該出願の全ての内容を引用により本発明に援用する。 The present invention claims priority to the Chinese Patent Application No. 202010463398.8 filed with the Chinese Patent Office on May 27, 2020, the entire content of which is incorporated by reference. It is incorporated into the present invention.

本発明は、データ処理の技術分野に関し、例えば、インテリジェント検索技術に関する。 The present invention relates to the technical field of data processing, for example to intelligent search technology.

現在、電子地図は複数の音声パケットを提供することができ、ユーザは、その中から自分が必要とする音声パケットを選択して使用することができる。通常、ユーザは、１つずつ試聴するという方式により自分が必要とする音声パケットを選択し、このような方式は操作が煩雑で効率が低い。 Currently, an electronic map can provide a plurality of voice packets, from which users can select and use the voice packets they need. Generally, users select the voice packets they need by listening to them one by one, which is cumbersome and inefficient.

以下は、本文について詳細に説明する主題の概要である。本概要は、特許請求の範囲を制限するものではない。 Below is a summary of the subject matter discussed in detail in the text. This summary is not intended to limit the scope of the claims.

本発明は、操作しやすく、効率がより高い音声パケット推薦方法、装置、機器および記憶媒体を提供する。 The present invention provides an easy-to-operate, more efficient voice packet recommendation method, apparatus, apparatus and storage medium.

本発明の一態様によれば、
音声パケットに関連する候補表示動画から少なくとも１つのターゲット表示動画をユーザのために選択し、前記ターゲット表示動画の属する音声パケットを候補音声パケットとすることと、
前記候補音声パケットの属性情報および前記ターゲット表示動画の属性情報に基づき、前記候補音声パケットからターゲット音声パケットを前記ユーザのために選択することと、
前記ターゲット音声パケットを前記ユーザに推薦することとを含む、
音声パケット推薦方法を提供する。 According to one aspect of the invention,
selecting for the user at least one target display animation from among the candidate display animations associated with the audio packet, the audio packet to which the target display animation belongs as a candidate audio packet;
selecting a target audio packet for the user from the candidate audio packets based on attribute information of the candidate audio packets and attribute information of the target display video;
and recommending the target voice packet to the user.
A voice packet recommendation method is provided.

本発明の別の態様によれば、
音声パケットに関連する候補表示動画から少なくとも１つのターゲット表示動画をユーザのために選択し、前記ターゲット表示動画の属する音声パケットを候補音声パケットとするように構成されるターゲット表示動画選択モジュールと、
前記候補音声パケットの属性情報および前記ターゲット表示動画の属性情報に基づき、前記候補音声パケットからターゲット音声パケットを前記ユーザのために選択するように構成されるターゲット音声パケット選択モジュールと、
前記ターゲット音声パケットを前記ユーザに推薦するように構成されるターゲット音声パケット推薦モジュールとを備える、
音声パケット推薦装置を提供する。 According to another aspect of the invention,
a target display animation selection module configured to select for a user at least one target display animation from among candidate display animations associated with an audio packet, the audio packet to which the target display animation belongs as a candidate audio packet;
a target audio packet selection module configured to select a target audio packet for the user from the candidate audio packets based on attribute information of the candidate audio packets and attribute information of the target display video;
a target voice packet recommendation module configured to recommend the target voice packet to the user;
A voice packet recommendation device is provided.

本発明のまた別の態様によれば、
少なくとも１つのプロセッサと、
前記少なくとも１つのプロセッサに通信接続されたメモリとを備える電子機器であって、
前記メモリには、前記少なくとも１つのプロセッサによって実行可能な命令が記憶され、前記命令は、前記少なくとも１つのプロセッサが本発明の実施例に係る音声パケット推薦方法を実行可能であるように、前記少なくとも１つのプロセッサにより実行される、
電子機器を提供する。 According to yet another aspect of the invention,
at least one processor;
and a memory communicatively coupled to the at least one processor,
The memory stores instructions executable by the at least one processor, the instructions exemplifying the at least executed by one processor;
Provide electronics.

本発明の更なる態様によれば、コンピュータ命令が記憶された非一時的なコンピュータ可読記憶媒体であって、前記コンピュータ命令は、本発明の実施例に係る音声パケット推薦方法を前記コンピュータに実行させるように設定される、
非一時的なコンピュータ可読記憶媒体を提供する。 According to a further aspect of the present invention, there is provided a non-transitory computer-readable storage medium having computer instructions stored thereon, said computer instructions causing said computer to perform a voice packet recommendation method according to an embodiment of the present invention. is set to
A non-transitory computer-readable storage medium is provided.

本発明の実施例は、音声パケットに関連する候補表示動画から少なくとも１つのターゲット表示動画をユーザのために選択し、ターゲット表示動画の属する音声パケットを候補音声パケットとし、候補音声パケットの属性情報およびターゲット表示動画の属性情報に基づき、候補音声パケットからターゲット音声パケットをユーザのために選択し、ターゲット音声パケットをユーザに推薦する。本発明の実施例は、ユーザが音声パケットを取得する利便性を高めるとともに、音声パケットの取得効率を向上させる。 An embodiment of the present invention selects for the user at least one target display animation from the candidate display animations associated with the audio packet, the audio packet to which the target display animation belongs as the candidate audio packet, attribute information of the candidate audio packet and Based on the attribute information of the target display video, a target audio packet is selected for the user from the candidate audio packets, and the target audio packet is recommended to the user. Embodiments of the present invention increase the convenience of a user to acquire voice packets and improve the efficiency of voice packet acquisition.

本発明に記載の内容は、本発明の実施例のキーとなるまたは重要な特徴を標識するためのものではなく、本発明の範囲を限定するものでもない。本発明の他の特徴は、以下の明細書により容易に理解することができる。 The description of the present invention is not intended to identify key or critical features of embodiments of the invention, nor is it intended to limit the scope of the invention. Other features of the present invention can be readily understood from the following specification.

図面および詳細な説明を閲読し理解することで、他の態様も理解できる。 Other aspects can be appreciated upon reading and understanding the drawings and detailed description.

図面は本形態をより良く理解するためのものであり、本発明を限定するものではない。 The drawings are for a better understanding of the present embodiment and do not limit the invention.

本発明の実施例に係る音声パケット推薦方法のフローチャートである。4 is a flow chart of a voice packet recommendation method according to an embodiment of the present invention; 本発明の実施例に係る別の音声パケット推薦方法のフローチャートである。4 is a flow chart of another voice packet recommendation method according to an embodiment of the present invention; 本発明の実施例に係る別の音声パケット推薦方法のフローチャートである。4 is a flow chart of another voice packet recommendation method according to an embodiment of the present invention; 本発明の実施例に係る別の音声パケット推薦方法のフローチャートである。4 is a flow chart of another voice packet recommendation method according to an embodiment of the present invention; 本発明の実施例に係る別の音声パケット推薦方法のフローチャートである。4 is a flow chart of another voice packet recommendation method according to an embodiment of the present invention; 本発明の実施例に係る第１ニューラルネットワークモデルの構造模式図である。1 is a structural schematic diagram of a first neural network model according to an embodiment of the present invention; FIG. 本発明の実施例に係る第２ニューラルネットワークモデルの構造模式図である。Fig. 4 is a structural schematic diagram of a second neural network model according to an embodiment of the present invention; 本発明の実施例に係るユーザのペルソナタグの確定過程の模式図である。FIG. 4 is a schematic diagram of a determination process of a user's persona tag according to an embodiment of the present invention; 本発明の実施例に係る音声パケット推薦装置の構造図である。1 is a structural diagram of a voice packet recommendation device according to an embodiment of the present invention; FIG. 本発明の実施例の音声パケット推薦方法を実現するための電子機器のブロック図である。FIG. 2 is a block diagram of electronic equipment for realizing the voice packet recommendation method of the embodiment of the present invention;

以下、図面を参照しながら本発明の例示的な実施例について説明し、ここで、理解を容易にするために、本発明の実施例の様々な詳細を含み、それらが例示的なものに過ぎないと見なされるべきである。従い、当業者は、本発明の範囲および精神から逸脱することなく、ここで記載される実施例に対して様々な変更および修正を行うことができることを認識すべきである。それと同様に、明瞭かつ簡単にするために、以下の説明において公知の機能および構造についての説明を省略する。 DETAILED DESCRIPTION OF THE INVENTION Illustrative embodiments of the invention will now be described with reference to the drawings, wherein various details of the embodiments of the invention are included for ease of understanding and are merely exemplary. should be considered not. Accordingly, those skilled in the art should appreciate that various changes and modifications can be made to the examples described herein without departing from the scope and spirit of the invention. Likewise, for the sake of clarity and brevity, descriptions of well-known functions and structures are omitted in the following description.

本発明の実施例に係る音声パケット推薦方法および音声パケット推薦装置は、音声アナウンス機能を含むアプリケーションを採用する過程で音声パケットの取得を行う場合に適用される。該音声パケット推薦装置は、音声パケット推薦装置により実行され、該装置は、ソフトウェア、またはハードウェア、またはソフトウェアおよびハードウェアで実現され、具体的に電子機器に設けられる。 The voice packet recommendation method and the voice packet recommendation device according to the embodiments of the present invention are applied when voice packets are acquired in the process of adopting an application including a voice announcement function. The voice packet recommendation device is implemented by a voice packet recommendation device, which is realized by software, hardware, or software and hardware, and specifically installed in an electronic device.

図１は、本発明の実施例に係る音声パケット推薦方法のフローチャートであり、該方法は、以下のステップを含む。 FIG. 1 is a flow chart of a voice packet recommendation method according to an embodiment of the present invention, which includes the following steps.

Ｓ１０１において、音声パケットに関連する候補表示動画から少なくとも１つのターゲット表示動画をユーザのために選択し、前記ターゲット表示動画の属する音声パケットを候補音声パケットとする。 At S101, at least one target display animation is selected for the user from the candidate display animations related to the audio packets, and the audio packets to which the target display animation belongs are taken as candidate audio packets.

ここで、音声パケットに関連する候補表示動画は、音声提供者のイメージ、声、および字幕等のうちの少なくとも１種を含み、音声パケットにおける音声提供者のイメージ特徴および音声特徴を表すように構成される。ここで、イメージ特徴は、ロリ、御姉、おじさん、ＩＰ（ＩｎｔｅｌｌｅｃｔｕａｌＰｒｏｐｅｒｔｙ、知的財産権）イメージ等のうちの少なくとも１種を含む。ここで、音声特徴は、音質特点、または音声スタイル、または音質特点および音声スタイル等を含む。ここで、音質特点は、男生、女生、甘い、およびかすれる等のうちの少なくとも１種を含み、音声スタイルは、アナウンス口調およびユーモア等のうちの少なくとも１種を含む。 Here, the candidate display moving image related to the voice packet includes at least one of the image, voice, caption, etc. of the voice provider, and is configured to represent the image feature and voice feature of the voice provider in the voice packet. be done. Here, the image feature includes at least one of loli, older sister, uncle, IP (Intellectual Property) image, and the like. Here, the voice features include voice quality features, voice styles, voice quality features and voice styles, or the like. Here, the sound quality features include at least one of male, female, sweet, hoarse, etc., and the voice style includes at least one of announcement tone, humor, and the like.

ここで、音声パケットには、少なくとも１つの候補表示動画が関連付けられている。一実施例において、音声パケットと候補表示動画との関連関係を、電子機器のローカル、電子機器に関連する他の記憶機器またはクラウドに予め記憶することができる。それに対応し、必要な場合、該関連関係に基づいて音声パケットに関連する候補表示動画からターゲット表示動画を検索する。一実施例において、ターゲット表示動画は、電子機器のローカル、電子機器に関連する他の記憶機器またはクラウドに予め記憶することができ、且つ、ターゲット表示動画が見つかった場合、ターゲット表示動画を取得する。例えば、ターゲット表示動画の動画ＩＤを検索し、該動画ＩＤに基づいてターゲット表示動画を取得することができる。 Here, at least one candidate display moving image is associated with the audio packet. In one embodiment, the association relationship between the audio packets and the candidate display videos can be pre-stored locally on the electronic device, in another storage device associated with the electronic device, or in the cloud. Correspondingly, if necessary, retrieve the target display animation from the candidate display animations related to the voice packet based on the association relationship. In one embodiment, the target display animation can be pre-stored locally on the electronic device, in another storage device associated with the electronic device, or in the cloud, and if the target display animation is found, obtain the target display animation. . For example, the video ID of the target display video can be searched, and the target display video can be obtained based on the video ID.

本発明の実施例の１つの好ましい実施形態において、ユーザの類似したユーザが音声パケットを取得した時に得た表示動画に基づき、音声パケットに関連する候補表示動画からターゲット表示動画をユーザのために選択することができる。 In one preferred embodiment of an embodiment of the present invention, a target display animation is selected for the user from the candidate display animations associated with the voice packet based on the display animation obtained when the user's similar users acquired the voice packet. can do.

ターゲット表示動画の選択時のデータ演算量を低減し、ターゲット表示動画の選択効率を向上させるために、本発明の実施例の別の好ましい実施形態において、ユーザが音声パケットを取得した時に得た履歴表示動画と各候補表示動画の類似度に基づき、音声パケットに関連する各候補表示動画からターゲット表示動画をユーザのために選択してもよい。 In order to reduce the amount of data calculation when selecting the target display animation and improve the efficiency of selecting the target display animation, in another preferred embodiment of the embodiment of the present invention, the history obtained when the user obtains the voice packet A target display animation may be selected for the user from each candidate display animation associated with the audio packet based on the degree of similarity between the display animation and each candidate display animation.

ビッグデータ量がリアルタイム結果に影響を及ぼすことを実現するために、本発明の実施例の更なる好ましい実施形態において、サンプルユーザおよびサンプルユーザの履歴行動データに基づいて機械学習モデルをトレーニングし、且つ、トレーニングされた機械学習モデルを採用し、音声パケットに関連する候補表示動画からターゲット表示動画をユーザのために選択してもよい。 To realize that the amount of big data affects real-time results, in a further preferred embodiment of the embodiments of the present invention, train a machine learning model based on sample users and historical behavior data of the sample users; , may employ a trained machine learning model to select for the user a target display animation from the candidate display animations associated with the audio packet.

それに対応し、音声パケットに関連する候補表示動画から少なくとも１つのターゲット表示動画をユーザのために選択した後、ターゲット表示動画の前記音声パケットを候補音声パケットとすることができる。なお、音声パケットの数が少なくとも１つであるため、音声パケットに関連する候補表示動画の数も少なくとも１つであり、従い、最終的に確定された候補音声パケットの数も少なくとも１つである。続いて、少なくとも１つの候補音声パケットからターゲット音声パケットを選択することができる。 Correspondingly, after selecting for the user at least one target display animation from the candidate display animations associated with the audio packet, said audio packet of the target display animation can be the candidate audio packet. Since the number of voice packets is at least one, the number of candidate display videos related to the voice packets is also at least one, and therefore the number of finally determined candidate voice packets is also at least one. . A target voice packet can then be selected from the at least one candidate voice packet.

Ｓ１０２において、前記候補音声パケットの属性情報および前記ターゲット表示動画の属性情報に基づき、前記候補音声パケットからターゲット音声パケットを前記ユーザのために選択する。 At S102, a target audio packet is selected for the user from the candidate audio packets based on the attribute information of the candidate audio packets and the attribute information of the target display video.

ここで、候補音声パケットの属性情報は、ユーザインタラクションデータおよび音声パケット説明データ等のうちの少なくとも１種である。ここで、ユーザインタラクションデータは、現在のユーザまたは他のユーザの候補音声パケットに対するインタラクション状況を表すように設定され、ここで、インタラクションは、クリック、ダウンロード、ブラウジング、コメント、および共有等のうちの少なくとも１種を含む。ここで、音声パケット説明データは、音声パケットの基本属性（例えば、声の特点、アナウンス特点、および音声パケット提供者のイメージ特点等のうちの少なくとも１つを含む）を表すように設定される。 Here, the attribute information of the candidate voice packet is at least one of user interaction data, voice packet description data, and the like. Here, the user interaction data is set to represent the current user's or other users' interaction status with respect to the candidate voice packet, where the interaction is at least one of clicking, downloading, browsing, commenting, sharing, etc. Includes 1 species. Here, the voice packet description data is set to represent basic attributes of the voice packet (eg, including at least one of voice features, announcement features, voice packet provider image features, etc.).

ここで、ターゲット表示動画の属性情報は、動画説明データおよび音声パケット関連データを含む。ここで、動画説明データは、動画自体の属性（例えば、動画タイプおよび動画ソース等のうちの少なくとも１種であってもよい）を表すように設定される。ここで、音声パケット関連データは、動画と音声パケットとの関連性（例えば、動画と音声パケットの類似度であってもよい）を表すように設定される。 Here, the attribute information of the target display moving image includes moving image description data and audio packet related data. Here, the animation description data is set to represent attributes of the animation itself (eg, at least one of animation type, animation source, etc.). Here, the audio packet related data is set so as to represent the relationship between the moving image and the audio packet (for example, it may be the degree of similarity between the moving image and the audio packet).

本発明の実施例の１つの好ましい実施形態において、ソートモデルに基づき、候補音声パケットの属性情報およびターゲット表示動画の属性情報に応じ、候補音声パケットからターゲット音声パケットをユーザのために選択することができる。ここで、ソートモデルは、属性モデルまたはニューラルネットワークモデルであってもよく、ソートモデルは、ｐｏｉｎｔｗｉｓｅ（ポイントワイズ）、ｐａｉｒｗｉｓｅ（ペアワイズ）またはｌｉｓｔｗｉｓｅ（リストワイズ）等のうちの少なくとも１種の方式に基づいて実現できる。 In one preferred embodiment of an embodiment of the present invention, a target audio packet is selected for the user from the candidate audio packets according to the attribute information of the candidate audio packets and the attribute information of the target display video based on the sorting model. can. Here, the sorting model may be an attribute model or a neural network model, and the sorting model is based on at least one method such as pointwise, pairwise or listwise. can be realized.

例示的には、ソートモデルをモデルトレーニングする時、ユーザの操作行動に基づいてトレーニングデータを自動的に構築することができる。ｌｉｓｔｗｉｓｅを例として、同じユーザは大量の動画をブラウジングし、これらの動画のソート関係は、ユーザの動画に対するインタラクション行動およびインタラクションの程度に基づいて確定することができる。例えば、「ダウンロード行動が変換された動画、クリックした動画、コメントした動画、ブラウジングし終わった動画、ブラウジングし終わっていない動画、およびほぼブラウジングしていない動画」という順に従い、高い順番で異なる動画を順にソートする。もちろん、技術者が必要または経験に応じてソート関係で動画の優先順位を追加または修正することもでき、本発明の実施例はこれについて限定しない。 Illustratively, when model training a sorting model, the training data can be automatically constructed based on the user's manipulation behavior. Taking listwise as an example, the same user browses a large number of videos, and the sorting relationship of these videos can be determined based on the user's interaction behavior and degree of interaction with the videos. For example, different videos in high order, following the order of "videos with converted download behavior, videos clicked, videos commented, videos finished browsing, videos not finished browsing, and videos hardly browsed". Sort by order. Of course, technicians can also add or modify the priority of animations in the sorting relationship according to their needs or experience, and the embodiments of the present invention are not limited in this respect.

なお、候補音声パケットからユーザのために選択したターゲット音声パケットの数は、少なくとも１つである。選択したターゲット音声パケットが少なくとも２つである場合、更に選択したターゲット音声パケットをソートすることもでき、例えば、前述したソートモデルを用いてソートしてもよいし、各ターゲット音声パケットの順序をランダムに確定してもよい。 Note that the number of target voice packets selected for the user from the candidate voice packets is at least one. If there are at least two target voice packets selected, the selected target voice packets may also be sorted, for example using the sorting model described above, or randomizing the order of each target voice packet. may be determined to

Ｓ１０３において、前記ターゲット音声パケットを前記ユーザに推薦する。 At S103, the target voice packet is recommended to the user.

ターゲット音声パケットをユーザに推薦することにより、ユーザはターゲット音声パケットに基づいて音声アナウンスサービスを提供する。ターゲット音声パケットが少なくとも２つである場合、ターゲット音声パケットをユーザに順次推薦し、且つ、ユーザの選択に基づき、最終的に音声アナウンスサービスを提供するターゲット音声パケットを確定することができる。 By recommending the target voice packet to the user, the user provides voice announcement service based on the target voice packet. If there are at least two target voice packets, the target voice packets can be recommended to the user in sequence, and based on the user's selection, the target voice packet that finally provides the voice announcement service can be determined.

本発明の実施例は、音声パケットに関連する候補表示動画から少なくとも１つのターゲット表示動画をユーザのために選択し、ターゲット表示動画の属する音声パケットを候補音声パケットとし、候補音声パケットの属性情報およびターゲット表示動画の属性情報に基づき、候補音声パケットからターゲット音声パケットをユーザのために選択し、ターゲット音声パケットをユーザに推薦する。上記技術案を採用し、音声パケットに関連する動画を、音声パケットを確定する中間媒体としてターゲット音声パケットの自動推薦を行うことにより、ユーザが音声パケットを検索することから音声パケットが能動的にユーザを検索することへの変換を実現する。それと同時に、動画を介して音声パケットを確定し、ユーザは音声パケットを頻繁に試聴する必要がなく、ユーザが音声パケットを取得する利便性を高めるとともに、音声パケットの取得効率を向上させる。 An embodiment of the present invention selects for the user at least one target display animation from the candidate display animations associated with the audio packet, the audio packet to which the target display animation belongs as the candidate audio packet, attribute information of the candidate audio packet and Based on the attribute information of the target display video, a target audio packet is selected for the user from the candidate audio packets, and the target audio packet is recommended to the user. By adopting the above technical solution and automatically recommending the target audio packet using the video related to the audio packet as an intermediate medium for determining the audio packet, the user searches for the audio packet and the audio packet is actively sent to the user. To achieve the conversion to search for . At the same time, the audio packets are determined through the moving image, and the user does not need to listen to the audio packets frequently, so that the user's convenience of acquiring the audio packets is improved, and the acquisition efficiency of the audio packets is improved.

図２は、本発明の実施例に係る別の音声パケット推薦方法のフローチャートであり、該方法に対応する技術案は、上記各技術案の基に、最適化および改良を行った。 FIG. 2 is a flow chart of another voice packet recommendation method according to an embodiment of the present invention, and the technical solution corresponding to the method is optimized and improved based on the above technical solutions.

一実施例において、ターゲット表示動画の確定メカニズムを完備するために、「音声パケットに関連する候補表示動画から少なくとも１つのターゲット表示動画をユーザのために選択する」という操作を、「前記ユーザのペルソナタグと前記音声パケットに関連する候補表示動画の分類タグとの相関度に基づき、少なくとも１つのターゲット表示動画を確定する」ことに細分化する。 In one embodiment, to complete the target display animation determination mechanism, the operation of "selecting at least one target display animation for the user from the candidate display animations associated with the voice packet" is replaced with "said user's persona Determining at least one target display animation based on the degree of correlation between the tag and the classification tag of the candidate display animation associated with the audio packet.

図２に示す音声パケット推薦方法は、以下のステップを含む。 The voice packet recommendation method shown in FIG. 2 includes the following steps.

Ｓ２０１において、前記ユーザのペルソナタグと前記音声パケットに関連する候補表示動画の分類タグとの相関度に基づき、少なくとも１つのターゲット表示動画を確定する。 At S201, at least one target display video is determined based on the degree of correlation between the user's persona tag and the classification tag of the candidate display video related to the voice packet.

ここで、ユーザのペルソナタグは、ユーザ自体の属性を表すように設定され、例えば、甘い、親切、面白い、および御姉等のうちの少なくとも１つを含んでもよい。 Here, the user's persona tag is set to represent the user's own attributes, and may include, for example, at least one of sweet, kind, funny, and older sister.

一実施例において、候補表示動画の分類タグは、音声提供者（即ち、動画中のイメージ）のイメージ特徴を表すように設定されるイメージタグを含んでもよく、例えば、ロリ、御姉、おじさん、およびＩＰイメージ等のうちの少なくとも１種である。または、一実施例において、候補表示動画の分類タグは、動画中の音声提供者の音声の特点を表すように設定される音質タグを含んでもよく、例えば、男生、女生、甘い、およびかすれる等のうちの少なくとも１種を含んでもよい。または、一実施例において、候補表示動画の分類タグは、動画中の音声アナウンススタイルを表すように設定される音質タグを含んでもよく、例えば、アナウンス口調および面白い等のうちの少なくとも１種を含んでもよい。 In one embodiment, the classification tags of the candidate display videos may include image tags that are set to represent the image characteristics of the audio provider (i.e., the image in the video), such as loli, older sister, uncle, and at least one of IP image and the like. Alternatively, in one embodiment, the classification tags of the candidate display videos may include sound quality tags that are set to represent the characteristics of the audio provider's voice in the video, such as male, female, sweet, and hoarse. It may contain at least one of Alternatively, in one embodiment, the classification tags of the candidate display videos may include sound quality tags that are set to represent the voice announcement style in the video, for example, including at least one of announcement tone and funny. It's okay.

例示的には、ユーザの履歴行動データに基づいてユーザのペルソナタグを確定することができる。ここで、履歴行動データは、ユーザが履歴動画に対してインタラクション行動を行うデータを含む。ここで、インタラクション行動は、クリック、ダウンロード、ブラウジング、コメント、および共有等のうちの少なくとも１種を含む。 Illustratively, a user's persona tag can be determined based on the user's historical behavioral data. Here, the history behavior data includes data of the user's interaction behavior with respect to history moving images. Here, interaction behavior includes at least one of clicking, downloading, browsing, commenting, sharing, and the like.

一実施例において、ユーザの履歴行動データに基づいてユーザのペルソナタグを確定することは、協調フィルタリングの方式に基づき、ユーザの履歴行動データにおける履歴動画と合わせて動画の分類タグを確定し、履歴行動データにおけるインタラクション行動タイプ、出現回数に基づいて重み付けソートを行い、ユーザのペルソナタグを取得することであってもよい。 In one embodiment, determining the user's persona tags based on the user's historical behavior data includes determining the classification tags of the videos together with the historical videos in the user's historical behavior data, based on a method of collaborative filtering, and determining the historical The user's persona tag may be obtained by performing weighted sorting based on the interaction action type and the number of appearances in the action data.

本発明の実施例の１つの好ましい実施形態において、候補表示動画の分類タグは、手動で表記する方式により追加することができる。 In one preferred embodiment of the embodiment of the present invention, classification tags for candidate display videos can be added by manual notation.

候補表示動画の分類タグの確定効率を向上させ、人件費を削減するために、本発明の実施例の別の好ましい実施形態において、候補表示動画の分類タグは、前記候補表示動画から画像を抽出し、抽出した画像を予めトレーニングされた多分類モデルに入力し、モデルの出力結果に応じて前記候補表示動画の少なくとも１つの分類タグを確定するという方式で確定することができる。ここで、多分類モデルはニューラルネットワークモデルであってもよい。 In order to improve the efficiency of determining the classification tags of the candidate display videos and reduce the labor cost, in another preferred embodiment of the embodiment of the present invention, the classification tags of the candidate display videos extract images from the candidate display videos. Then, the extracted images are input to a pre-trained multi-classification model, and at least one classification tag of the candidate display video is determined according to the output result of the model. Here, the multi-class model may be a neural network model.

動画が、イメージタグ、音質タグ、および音声スタイルタグ等のような異なる次元の分類タグを有し、異なる次元の分類タグが通常複数のタグ値に対応し、異なる動画も複数のタグ値に対応する可能性があるため、候補表示動画の分類タグを確定する時、多分類タスクを実行することに相当する。 A video has different dimensional classification tags, such as image tags, sound quality tags, audio style tags, etc. Different dimensional classification tags usually correspond to multiple tag values, and different videos also correspond to multiple tag values. Therefore, it is equivalent to performing a multi-classification task when determining the classification tags of candidate display videos.

多分類タスクのバッチ処理を実現するために、本発明は、候補表示動画から抽出された少なくとも１枚の画像を分類タグの確定根拠とし、抽出された各画像を予めトレーニングされた多分類モデルに入力し、異なる次元に対応する各タグ値の確率値を取得し、各タグ値の確率値に基づいて候補表示動画の少なくとも１つの分類タグを確定する。一実施例において、設定数閾値の、または確率値が設定確率閾値よりも大きい、または設定数閾値のかつ確率値が設定確率閾値よりも大きい各タグ値を候補表示動画の分類タグとして選択することができる。ここで、設定数閾値および設定確率閾値は、技術者が必要または経験値に応じて設定されるか、または大量の試験により繰り返し確定される。 In order to realize batch processing of multi-classification tasks, the present invention takes at least one image extracted from the candidate display video as the basis for establishing a classification tag, and applies each extracted image to a pre-trained multi-classification model. input, obtain the probability value of each tag value corresponding to different dimensions, and determine at least one classification tag of the candidate display video based on the probability value of each tag value. In one embodiment, each tag value with a set number threshold value or a probability value greater than the set probability threshold value, or a set number threshold value and a probability value greater than the set probability threshold value is selected as the classification tag of the candidate display video. can be done. Here, the set number threshold and the set probability threshold are set by engineers according to their needs or experience, or repeatedly determined by mass testing.

例示的には、多分類モデルは、特徴抽出層と出力層とを備える。ここで、特徴抽出層は、出入りする画像に対して特徴を抽出するように構成され、出力層は、抽出された特徴に基づいて分類タグを確定するように構成される。 Illustratively, the multi-classification model comprises a feature extraction layer and an output layer. Here, the feature extraction layer is configured to extract features for incoming and outgoing images, and the output layer is configured to determine classification tags based on the extracted features.

本発明の実施例の１つの好ましい実施形態において、分類タグの確定効率を向上させるために、各分類タグを確定する過程において多分類モデルのモデルパラメータを共有することができる。例示的には、分類タグが少なくとも２種のタイプを含む場合、多分類モデルでは、分類タグのタイプ毎に１つの分類器を設けて各タイプのタグ値を確定することができ、特徴抽出層のネットワークパラメータの共有を実現し、これにより、異なる分類タグの確定過程において、抽出された特徴が互いに促進し、共通特徴を抽出し、分類タグの確定結果の関連性および正確性をある程度で向上させることができる。 In one preferred embodiment of an embodiment of the present invention, the model parameters of multiple classification models can be shared in the process of determining each classification tag in order to improve the efficiency of determining classification tags. Illustratively, if a classification tag includes at least two types, the multi-classification model can have one classifier for each type of classification tag to determine the tag value for each type, and the feature extraction layer of network parameters, so that in the process of determining different classification tags, the extracted features will promote each other, extract common features, and improve the relevance and accuracy of the determination results of classification tags to some extent. can be made

多分類モデルのトレーニング段階では、サンプル動画から抽出されたサンプル画像およびサンプル分類タグを、予め構築されたニューラルネットワークモデルに対してトレーニングし、前記多分類モデルを取得することができる。ここで、サンプル分類タグは、手動で表記する方式により実現することができる。 In the multi-classification model training stage, the sample images and sample classification tags extracted from the sample video can be trained against a pre-built neural network model to obtain said multi-classification model. Here, sample classification tags can be implemented by a manual notation method.

多分類モデルのトレーニングサンプルの準備段階では、手動で表記する方式によりサンプル動画のサンプル分類タグを確定し、時間がかかって手間がかかる。トレーニングサンプルの準備段階に投入される人件費および時間コストを低減し、トレーニングサンプルの準備効率を向上させるとともに、コールドスタートの問題を解決し、トレーニングサンプルのデータ量を拡張するために、本発明の実施例の別の好ましい実施形態において、多分類モデルのトレーニング段階では、手動で表記する方式の代わりに、サンプル動画中の関連データを転移する方式を採用することによりサンプル動画のサンプル分類タグを生成することができる。例示的には、サンプル動画の文字記述、またはサンプル動画を視聴するユーザのペルソナ、またはサンプル動画の文字記述およびサンプル動画を視聴するユーザのペルソナを、前記サンプル動画のサンプル分類タグとし、前記サンプル動画から抽出したサンプル画像および前記サンプル分類タグに基づき、予め構築されたニューラルネットワークモデルをトレーニングし、前記多分類モデルを取得することができる。 In the preparation stage of the training samples of the multi-classification model, the manual notation method is used to determine the sample classification tags of the sample videos, which is time consuming and laborious. In order to reduce the labor cost and time cost invested in the training sample preparation stage, improve the training sample preparation efficiency, solve the cold start problem, and expand the data volume of the training sample, the In another preferred embodiment of the embodiment, the training stage of the multi-classification model generates the sample classification tags for the sample videos by adopting the method of transposing the relevant data in the sample videos instead of the method of manual notation. can do. Illustratively, the character description of the sample video, the persona of the user who watches the sample video, or the character description of the sample video and the persona of the user who watches the sample video are used as the sample classification tags of the sample video, and the sample video is Based on the sample images extracted from and the sample classification tags, a pre-built neural network model can be trained to obtain the multi-classification model.

例示的には、ユーザのペルソナタグと音声パケットに関連する候補表示動画の分類タグとの相関度を確定し、相関度値に基づいて各候補表示動画をソートし、ソート結果に応じ、少なくとも１つの候補表示動画をターゲット表示動画として確定する。 Exemplarily, determining the degree of correlation between the user's persona tag and the classification tag of the candidate display videos associated with the voice packet, sorting each candidate display video based on the correlation value, and depending on the sorting result, at least one One candidate display video is determined as the target display video.

一実施例において、ユーザのペルソナタグ、または候補表示動画の分類タグ、またはユーザのペルソナタグおよび候補表示動画の分類タグは、電子機器のローカルまたは電子機器に関連する記憶機器に予め記憶することができ、且つ、必要な場合、ユーザのペルソナタグ、または候補表示動画の分類タグ、またはユーザのペルソナタグおよび候補表示動画の分類タグを取得する。あるいは、一実施例において、ユーザのペルソナタグ、または候補表示動画の分類タグ、またはユーザのペルソナタグおよび候補表示動画の分類タグは、ターゲット表示動画を確定する過程において、前述した少なくとも１種の方式を採用し、ユーザのペルソナタグ、または候補表示動画の分類タグ、またはユーザのペルソナタグおよび候補表示動画の分類タグをリアルタイムに確定してもよい。それに対応し、取得または確定されたユーザのペルソナタグおよび音声パケットに関連する候補表示動画の分類タグに基づいて相関度を確定し、更に相関度に基づいてターゲット表示動画を選択する。 In one embodiment, the user's persona tag, or the candidate display video's classification tag, or the user's persona tag and the candidate display video's classification tag, may be pre-stored in a storage device local to or associated with the electronic device. If possible and necessary, obtain the user's persona tag, or the classification tag of the candidate-displayed animation, or the user's persona tag and the classification tag of the candidate-displayed animation. Alternatively, in one embodiment, the user's persona tag, or the classification tag of the candidate display animation, or the user's persona tag and the classification tag of the candidate display animation are used in the process of determining the target display animation, at least one of the methods described above. may be employed to determine the user's persona tag, or the classification tag of the candidate display video, or the user's persona tag and the classification tag of the candidate display video in real time. Correspondingly, determining the degree of correlation based on the acquired or determined persona tag of the user and the classification tag of the candidate display video associated with the voice packet, and further selecting the target display video based on the degree of correlation.

Ｓ２０２において、前記ターゲット表示動画の属する音声パケットを候補音声パケットとする。 In S202, the audio packet to which the target display moving image belongs is set as a candidate audio packet.

Ｓ２０３において、前記候補音声パケットの属性情報および前記ターゲット表示動画の属性情報に基づき、前記候補音声パケットからターゲット音声パケットを前記ユーザのために選択する。 In S203, a target audio packet is selected for the user from the candidate audio packets based on the attribute information of the candidate audio packets and the attribute information of the target display video.

Ｓ２０４において、前記ターゲット音声パケットを前記ユーザに推薦する。 At S204, the target voice packet is recommended to the user.

本発明の実施例は、音声パケットに関連する候補表示動画から少なくとも１つのターゲット表示動画をユーザのために選択するという操作を、ユーザのペルソナタグと音声パケットに関連する候補表示動画の分類タグとの相関度に基づき、少なくとも１つのターゲット表示動画を確定することに細分化する。上記技術案は、ユーザのペルソナタグおよび候補表示動画の分類タグを参照音因子としてターゲット表示動画を選択することにより、ユーザの興味により合致するターゲット表示動画を選択し、その後に選択されるターゲット音声パケットとユーザとの合致度に基礎を定める。 Embodiments of the present invention combine the operation of selecting at least one target display video for a user from the candidate display videos associated with the voice packet with the user's persona tag and the classification tags of the candidate display videos associated with the voice packet. to determine at least one target display animation. The above technical proposal selects the target display video that more closely matches the user's interest by selecting the target display video using the user's persona tag and the classification tag of the candidate display video as reference sound factors. Basis on the degree of match between packets and users.

図３は、本発明の実施例に係る別の音声パケット推薦方法のフローチャートであり、該方法に対応する技術案は、上記各技術案の基に、最適化および改良を行った。 FIG. 3 is a flow chart of another voice packet recommendation method according to an embodiment of the present invention, and the technical solution corresponding to the method is optimized and improved based on the above technical solutions.

一実施例において、音声パケット推薦方法を実行する場合、音声パケットと候補表示動画との関連関係構築メカニズムを完備するために、「前記音声パケットの初期表示動画を確定し、各前記初期表示動画の動画ソースの優先度に基づき、前記音声パケットに関連する前記候補表示動画を確定する」ことを追加する。 In one embodiment, when implementing the audio packet recommendation method, in order to complete the relationship building mechanism between the audio packet and the candidate display animation, the initial display animation of the audio packet is determined, and each of the initial display animations is Determine the candidate display video associated with the audio packet based on video source priority.

一実施例において、音声パケット推薦方法を実行する場合、音声パケットと候補表示動画との関連関係構築メカニズムを完備するために、「前記音声パケットの初期表示動画を確定し、各前記初期表示動画と前記音声パケットの類似度に基づき、前記音声パケットに関連する前記候補表示動画を確定する」ことを追加する。 In one embodiment, when implementing the voice packet recommendation method, in order to complete the relationship building mechanism between the voice packet and the candidate display videos, the initial display video of the voice packet is determined, and each of the initial display videos and the determine the candidate display video related to the audio packet based on the similarity of the audio packet."

図３に示す音声パケット推薦方法は、以下のステップを含む。 The voice packet recommendation method shown in FIG. 3 includes the following steps.

Ｓ３０１において、前記音声パケットの初期表示動画を確定する。 In S301, the initial display moving image of the audio packet is determined.

本発明の実施例の１つの好ましい実施形態において、音声パケット提供者が直接動画を録画する方式により、音声パケットの初期表示動画を生成することができる。音声パケット提供者が自分の音声パケットのスタイル特点をより良く知っているため、音声パケット特点を更に強調できる動画を録画し、初期表示動画と音声パケットとを更に合わせることが理解できる。 In one preferred embodiment of the embodiment of the present invention, the initial display animation of the audio packet can be generated by the audio packet provider directly recording the animation. Since the audio packet provider is better aware of the style characteristics of their audio packets, it is understandable to record a video that can further emphasize the audio packet characteristics, and to better match the initial display video with the audio packets.

初期表示動画の生成効率を向上させ、初期表示動画の生成に投入される人的・物的コストを低減するために、本発明の実施例の別の好ましい実施形態において、音声パケット提供者のプロモーション画像に基づき、音声パケットのプロモーションテキストを確定し、前記音声パケット提供者の音響合成モデルに基づき、前記プロモーションテキストに応じてプロモーション音声およびプロモーション字幕を生成し、前記プロモーション画像、前記プロモーション音声および前記プロモーション字幕に基づき、前記初期表示動画を生成することもできる。 In order to improve the efficiency of generating the initial display animation and reduce the human and material costs invested in the generation of the initial display animation, in another preferred embodiment of the embodiment of the present invention, promotion of voice packet providers determining a promotional text for a voice packet based on the image; generating a promotional voice and a promotional subtitle according to the promotional text according to the audio synthesis model of the voice packet provider; The initial display video can also be generated based on subtitles.

例示的には、プロモーション画像に含まれる音声パケット提供者に基づいて音声パケットのプロモーションテキストを確定することができる。例えば、音声パケット提供者の紹介情報をプロモーションテキストとする。音声パケット提供者の音響合成モデルに基づき、プロモーションテキストに応じてプロモーション音声を生成し、プロモーション音声に対応するプロモーション字幕を生成する。プロモーション音声およびプロモーション字幕に音声パケット宣伝機能を更に持たせるために、プロモーションテキストに応じてプロモーション音声およびプロモーション字幕を生成する時、予め構築されたキャッチコピーのテンプレートに基づいてプロモーション字幕を生成し、且つ、音声パケット提供者の音響合成モデルに基づいてプロモーション字幕に対応するプロモーション音声を合成することもでき、これにより、音声パケット提供者の声を模擬するという目的を達成し、音声パケット提供者の音声再生のプロモーション字幕を取得する。 Illustratively, the promotional text for the voice packet can be determined based on the voice packet provider included in the promotional image. For example, introduction information of a voice packet provider is used as a promotional text. Based on the audio synthesis model of the voice packet provider, the promotional voice is generated according to the promotional text, and the promotional subtitles corresponding to the promotional voice are generated. generating the promotional subtitles based on a pre-constructed tagline template when generating the promotional voice and promotional subtitles according to the promotional text, in order to make the promotional voice and promotional subtitles more capable of voice packet advertising; , the promotional voice corresponding to the promotional subtitles can also be synthesized according to the audio synthesis model of the voice packet provider, thereby achieving the purpose of simulating the voice of the voice packet provider, Get promotional subtitles for playback.

ここで、キャッチコピーのテンプレートは、技術者が必要または宣伝経験に応じて構築することができ、例えば、電子地図に対応する音声パケットにおいて、「（プロフィール）私の音声パケットのご使用を歓迎します、（人物名称）あなたと一緒に安全に出かけましょう」というキャッチコピーのテンプレートを採用することができる。 Here, the template of the tagline can be constructed according to the needs or advertising experience of engineers. (Person name) Let's go out safely with you."

上記テンプレート化して作製する方式により動画を生成し、動画を録画する必要がなく、動画生成効率を向上させるとともに、動画生成の人的・物的コストを低減することが理解できる。 It can be understood that the method of creating a template eliminates the need to generate a moving image and record the moving image, thereby improving the efficiency of generating a moving image and reducing the human and physical costs of generating a moving image.

初期表示動画の生成効率を向上させ、初期表示動画の生成に投入される人的・物的コストを低減するために、本発明の実施例のまた別の好ましい実施形態において、更に音声パケット提供者情報に基づいて動画検索ワードを構築し、前記動画検索ワードに基づき、前記初期表示動画として、前記音声パケット提供者の動画を検索することもできる。 In order to improve the efficiency of generating the initial display animation and reduce the human and material costs invested in generating the initial display animation, in another preferred embodiment of the present invention, the audio packet provider further includes: A video search word may be constructed based on the information, and a video of the audio packet provider may be retrieved as the initially displayed video based on the video search word.

ここで、音声パケット提供者情報は、甘い、かすれる、親切等を含む声の特点のような音声パケット提供者の特点説明情報を含み、ユーモア、面白い等を含むアナウンススタイルを更に含んでもよい。 Here, the voice packet provider information includes feature description information of the voice packet provider, such as voice features including sweet, hoarse, kind, etc., and may further include an announcement style including humor, funny, etc.

全ネットワークからマイニングする方式により、音声パケット提供者情報に関連する動画を検索し、動画を録画する必要がなく、動画生成効率を向上させるとともに、動画生成の人的・物的コストを低減することが理解できる。 To improve the efficiency of video generation and reduce the human and material costs of video generation by eliminating the need to search for videos related to voice packet provider information and record videos by mining from the entire network. is understandable.

Ｓ３０２において、各前記初期表示動画の動画ソースの優先度、または各前記初期表示動画と前記音声パケットの類似度、または各前記初期表示動画の動画ソースの優先度および各前記初期表示動画と前記音声パケットの類似度に基づき、前記音声パケットに関連する前記候補表示動画を確定する。 In S302, the priority of the moving image source of each of the initially displayed moving images, or the similarity between each of the initially displayed moving images and the audio packet, or the priority of the moving image source of each of the initially displayed moving images and each of the initially displayed moving images and the audio. Based on the packet similarity, the candidate display videos related to the audio packet are determined.

一実施例において、異なる動画ソースに対応する動画ソースの優先度が予め設定されているため、異なるソースの初期表示動画に対し、動画ソースの優先度に基づき、音声パケットに関連する候補表示動画を確定することができる。ここで、動画ソースの優先度は、音声パケットと候補表示動画との関連性を表すことができ、優先度が高いほど、関連性が大きい。動画ソースの優先度を導入することにより、音声パケットと候補表示動画との間の相関度を確保し、その後に音声パケットを選択するために基礎を定め、ターゲット音声パケット推薦結果とユーザとの間の合致結果の正確性に保障を提供することが理解できる。 In one embodiment, since the priority of video sources corresponding to different video sources is preset, for the initially displayed videos of different sources, the candidate display video related to the audio packet is selected based on the priority of the video source. can be determined. Here, the priority of the video source can represent the relevance between the audio packet and the candidate display video, and the higher the priority, the greater the relevance. By introducing the priority of the video source, we ensure the degree of correlation between the audio packet and the candidate display video, and then establish the basis for selecting the audio packet, and the relationship between the target audio packet recommendation result and the user. It can be appreciated that this provides a guarantee of the accuracy of the matching results.

例示的には、動画ソースは、音声パケット提供者が録画したもの、テンプレート化して作製したもの、および全ネットワークからマイニングしたもの等のうちの少なくとも１種を含んでもよい。ここで、動画ソースの優先度は、技術者が必要または経験に応じて設定することができる。初期表示動画に動画ソースの変動が存在する場合、それに対応し、技術者は必要または経験に応じ、動画ソースの優先度で動画ソースを編集し、各動画ソースの優先度順序を調整することもできる。ここで、動画ソースの変動は、動画ソースの追加または削除を含んでもよく、それに対応し、動画ソースに対する編集は、動画ソースの追加または動画ソースの削除であってもよい。 Illustratively, video sources may include at least one of audio packet provider recordings, templated creations, mined from entire networks, and the like. Here, the priority of video sources can be set by engineers according to their needs or experience. If there is variation in the video source in the initial display video, correspondingly, the technician may edit the video source with priority of the video source and adjust the priority order of each video source according to need or experience. can. Here, the change of the video source may include addition or deletion of the video source, and correspondingly, the editing of the video source may be the addition of the video source or the deletion of the video source.

例えば、動画ソースは、音声パケット提供者が録画したもの、テンプレート化して作製したもの、および全ネットワークからマイニングしたものを含む場合、設定された動画ソースの優先度は、高い順番で、音声パケット提供者が録画したもの、テンプレート化して作製したもの、および全ネットワークからマイニングしたものであってもよい。 For example, if the video sources include those recorded by the audio packet provider, templates created, and mined from all networks, the priority of the set video sources is in descending order. It may be recorded by a person, created by template, or mined from the entire network.

一実施例において、同じまたは異なるソースの初期表示動画に対し、各初期表示動画と音声パケットの類似度を確定し、類似度に基づいて音声パケットに関連する候補表示動画を確定することもできる。類似度を導入することにより、音声パケットと候補表示動画との関連関係の構築を補助し、音声パケットと候補表示動画との間の相関度を確保し、その後に音声パケットを選択するために基礎を定め、ターゲット音声パケット推薦結果とユーザとの間の合致結果の正確性に保障を提供することが理解できる。 In one embodiment, for initially displayed videos of the same or different sources, the similarity between each initially displayed video and the audio packet may be determined, and based on the similarity, candidate displayed videos related to the audio packet may be determined. By introducing the similarity, we help build the relationship between the audio packets and the candidate display videos, ensure the correlation between the audio packets and the candidate display videos, and then use it as a basis for selecting the audio packets. , to provide a guarantee of the accuracy of the matching result between the target voice packet recommendation result and the user.

例示的には、ニューラルネットワークの方式により音声パケットの声と初期表示動画とのコサイン類似度を計算することができ、各初期表示動画のコサイン類似度をソートし、設定数閾値の、または設定数条件を満たす、または設定数閾値のかつ設定数条件を満たす初期表示動画を音声パケットに関連する候補表示動画として選択する。ここで、設定数閾値または設定数条件は、技術者が必要または経験値に応じて設定することができる。 Illustratively, the cosine similarity between the voice of the audio packet and the initial display video can be calculated by a neural network method, and the cosine similarity of each initial display video is sorted and divided into a set number threshold or a set number An initial display video that satisfies a condition or a set number threshold and a set number condition is selected as a candidate display video related to an audio packet. Here, the setting number threshold value or the setting number condition can be set according to the needs or experience of the engineer.

ここで、ニューラルネットワークのトレーニング段階では、手動で表記する方式によりトレーニングコーパスを構築し、サンプル音声パケットおよびサンプル音声パケットに対応するポジティブ・ネガティブサンプル動画を取得することができ、それに対応し、トレーニングコーパスによりニューラルネットワークをトレーニングし、ニューラルネットワークにおけるネットワークパラメータを最適化して調整する。 Here, in the training stage of the neural network, a training corpus is constructed by a manual notation method, and sample speech packets and positive/negative sample videos corresponding to the sample speech packets can be obtained, and the corresponding training corpus to train the neural network and optimize and tune the network parameters in the neural network.

本発明の実施例の１つの好ましい実施形態において、音声パケットと音声パケットに関連する候補表示動画とを関連付けて電子機器のローカルまたは電子機器に関連する他の記憶機器に記憶することができる。記憶効率を向上させるために、キーバリュー（ｋｅｙ－ｖａｌｕｅ）の方式により音声パケットと候補表示動画との関連関係を記憶することができる。一実施例において、フォワードリンクを採用し、音声パケットＩＤをキー（ｋｅｙ）とし、候補表示動画の関連情報をバリュー（ｖａｌｕｅ）として記憶してもよいし、または、一実施例において、転置リンクを採用し、動画のタグ情報をｋｅｙとし、音声パケットＩＤをｖａｌｕｅとして記憶してもよい。 In one preferred embodiment of an embodiment of the present invention, the audio packets and candidate display animations associated with the audio packets can be associated and stored locally on the electronic device or in other storage devices associated with the electronic device. In order to improve storage efficiency, it is possible to store the relationship between the audio packets and the candidate display animations in a key-value manner. In one embodiment, the forward link may be adopted, with the audio packet ID as the key, and the relevant information of the candidate display video as the value, or in one embodiment, the transposed link may be stored as the value. may be employed, the tag information of the moving image may be used as the key, and the audio packet ID may be stored as the value.

後にターゲット表示動画の属性情報を取得しやすいために、フォワードリンクで記憶する場合、候補表示動画の属性情報を候補表示動画の関連情報としてｖａｌｕｅに記憶してもよい。 In order to facilitate acquisition of the attribute information of the target display animation later, when storing with the forward link, the attribute information of the candidate display animation may be stored in value as related information of the candidate display animation.

音声パケットと候補表示動画との間の関連性を更に確保するとともに、音声パケットと候補表示動画との関連関係を構築する時のデータ演算量を低減するために、一実施例において、まず、各初期表示動画の動画ソースの優先度に基づいて初期表示動画を予備選別し、各選別した初期表示動画と音声パケットの類似度に基づき、選別した初期表示動画を再び選別し、音声パケットに関連する候補表示動画を取得してもよい。 In order to further ensure the relationship between the audio packets and the candidate display animations and to reduce the amount of data calculation when building the association relationship between the audio packets and the candidate display animations, in one embodiment, first, each preselecting the initial display video according to the priority of the video source of the initial display video; reselecting the selected initial display video according to the similarity between each selected initial display video and the audio packet; You may acquire a candidate display animation.

Ｓ３０３において、音声パケットに関連する候補表示動画から少なくとも１つのターゲット表示動画をユーザのために選択し、前記ターゲット表示動画の属する音声パケットを候補音声パケットとする。 At S303, at least one target display animation is selected for the user from the candidate display animations related to the audio packet, and the audio packet to which the target display animation belongs is taken as the candidate audio packet.

Ｓ３０４において、前記候補音声パケットの属性情報および前記ターゲット表示動画の属性情報に基づき、前記候補音声パケットからターゲット音声パケットを前記ユーザのために選択する。 At S304, a target audio packet is selected for the user from the candidate audio packets based on the attribute information of the candidate audio packets and the attribute information of the target display video.

Ｓ３０５において、前記ターゲット音声パケットを前記ユーザに推薦する。 At S305, the target voice packet is recommended to the user.

本発明の実施例は、音声パケット推薦を行う過程において、音声パケットの初期表示動画の確定を追加し、且つ、各初期表示動画の動画ソースの優先度、または各前記初期表示動画と前記音声パケットの類似度、または各初期表示動画の動画ソースの優先度および各前記初期表示動画と前記音声パケットの類似度に基づき、前記音声パケットに関連する前記候補表示動画を確定する。上記技術案を採用し、音声パケットと候補表示動画との関連関係の構築メカニズムを完備し、その後にターゲット表示動画を選択し、更に候補音声パケットおよびターゲット音声パケットを段階的に選択するために基礎を定める。それと同時に、動画ソースの優先度、または動画と音声パケットの類似度、または動画ソースの優先度および動画と音声パケットの類似度により、初期動画を選別し、音声パケットに関連する候補表示動画を取得し、音声パケットと候補表示動画との間の相関度を確保し、ターゲット音声パケット推薦結果とユーザとの間の合致結果の正確性に保障を提供する。 The embodiment of the present invention adds determination of the initial display video of the audio packet in the process of audio packet recommendation, and the priority of the video source of each initial display video, or the priority of each initial display video and the audio packet or the priority of the video source of each initially displayed video and the similarity between each said initially displayed video and said audio packet, the candidate displayed video related to said audio packet is determined. Adopt the above technical solution, complete the mechanism of building the relationship between the voice packet and the candidate display video, then select the target display video, and further the basis for step-by-step selection of the candidate voice packet and the target voice packet. determine. At the same time, according to the priority of the video source, or the similarity between the video and the audio packet, or the priority of the video source and the similarity between the video and the audio packet, the initial video is screened to obtain candidate display videos related to the audio packet. and ensure the degree of correlation between the voice packet and the candidate display video, and provide a guarantee of the accuracy of the matching result between the target voice packet recommendation result and the user.

図４は、本発明の実施例に係る別の音声パケット推薦方法のフローチャートであり、該方法に対応する技術案は、上記各技術案の基に、最適化および改良を行った。 FIG. 4 is a flow chart of another voice packet recommendation method according to an embodiment of the present invention, and the technical solution corresponding to the method is optimized and improved based on the above technical solutions.

一実施例において、ターゲット音声パケット推薦メカニズムを完備するために、「前記ターゲット音声パケットを前記ユーザに推薦する」ことを、「前記ターゲット音声パケットに関連するターゲット表示動画により、前記ターゲット音声パケットを前記ユーザに推薦する」ことに細分化する。 In one embodiment, to complete the target voice packet recommendation mechanism, ``recommending the target voice packet to the user'' means ``recommending the target voice packet to the user according to a target display video associated with the target voice packet. It is subdivided into “recommend to users”.

図４に示す音声パケット推薦方法は、以下のステップを含む。 The voice packet recommendation method shown in FIG. 4 includes the following steps.

Ｓ４０１において、音声パケットに関連する候補表示動画から少なくとも１つのターゲット表示動画をユーザのために選択し、前記ターゲット表示動画の属する音声パケットを候補音声パケットとする。 At S401, at least one target display animation is selected for the user from the candidate display animations associated with the audio packets, and the audio packets to which the target display animation belongs are taken as candidate audio packets.

Ｓ４０２において、前記候補音声パケットの属性情報および前記ターゲット表示動画の属性情報に基づき、前記候補音声パケットからターゲット音声パケットを前記ユーザのために選択する。 At S402, a target audio packet is selected for the user from the candidate audio packets based on the attribute information of the candidate audio packets and the attribute information of the target display video.

Ｓ４０３において、前記ターゲット音声パケットに関連するターゲット表示動画により、前記ターゲット音声パケットを前記ユーザに推薦する。 At S403, the target voice packet is recommended to the user by a target display animation associated with the target voice packet.

動画表示の方式により、ユーザは、ターゲット音声パケットの特点をより直観的かつ全面的に取得し、且つ、ユーザのターゲット音声パケットに対する印象を強化し、更にユーザの選択効率を向上させることができる。且つ、動画表示の方式によりユーザに情報を提供し、ユーザは音声パケット特点情報をより簡単に取得し、ユーザのブラウジング体験および使用体験を向上させることができることが理解できる。 Through the way of displaying animation, the user can more intuitively and comprehensively acquire the features of the target voice packet, enhance the user's impression of the target voice packet, and further improve the user's selection efficiency. In addition, it can be understood that the information is provided to the user in the form of animation display, so that the user can more easily obtain the voice packet feature information and improve the user's browsing and usage experience.

ユーザによるターゲット音声パケットのダウンロードを容易にし、ダウンロードステップを短縮するために、ターゲット表示動画において、ターゲット音声パケットのダウンロードリンクを加えてもよい。ここで、ダウンロードリンクは、ｗｅｂサイトまたはｗｅｂサイト情報を担持する２次元コードにより示すことができる。 A download link of the target audio packet may be added in the target display animation to facilitate the user to download the target audio packet and shorten the download step. Here, the download link can be indicated by a website or a two-dimensional code carrying website information.

一実施例において、ターゲット表示動画が少なくとも２つ存在する場合、スライド切り替えの方式により動画の順次再生を実現することができ、ユーザの操作をより容易にする。 In one embodiment, when there are at least two target display videos, the videos can be sequentially played back in a slide switching manner, making it easier for the user to operate.

一実施例において、動画のインタラクション性を更に増強するために、ターゲット表示動画に共有、いいね、およびコメント露出機能を加えてもよく、これにより、ユーザの動画インタラクションまたはユーザ間のインタラクションのステップを短縮し、ユーザの関与度を高めるとともに、動画のユーザ間での伝播効率を向上させる。 In one embodiment, to further enhance the interactivity of the video, the target viewing video may be added with sharing, liking, and commenting exposure functions, which allow the step of user's video interaction or interaction between users to be It shortens the time, increases user involvement, and improves the propagation efficiency of moving images between users.

図５Ａは、本発明の実施例に係る別の音声パケット推薦方法のフローチャートであり、該方法に対応する技術案は、上記各技術案の基に、１つの好ましい実施形態を提供する。 FIG. 5A is a flow chart of another voice packet recommendation method according to an embodiment of the present invention, and the corresponding technical solution of the method provides a preferred embodiment based on the above technical solutions.

図５Ａに示す音声パケット推薦方法は、音声パケット動画の生成と、音声パケット動画の記憶および音声パケットの個人化推薦との２つ段階を含む。 The audio packet recommendation method shown in FIG. 5A includes two stages: generation of audio packet animation, storage of audio packet animation and personalized recommendation of audio packets.

１、音声パケット動画の生成 1. Generate audio packet video

ａ、初期動画の生成
音声パケット動画のソースは、主に、専門業者が生産したものと、全ネットワークからマイニングしたものと、テンプレート化して作製したものとの３種類がある。具体的には、以下のとおりである。 a. Generation of Initial Animation There are mainly three types of audio packet animation sources: those produced by specialized companies, those mined from the entire network, and those produced by making templates. Specifically, it is as follows.

専門業者が生産したもの：主に、音声パケット提供者が動画を録画する方式により初期動画を生産する。音声パケット提供者は、自分の音声パケットの特点（音色、スタイル等）をより良く知っているため、音声パケット特点を強調する動画を録画する。Ａちゃんの音声パケット動画の作製を例とし、これは、若くて綺麗な女生の、甘くて親切な声の音声パケットである場合、動画で甘く着飾り、いくつかの親切なセリフ（お兄ちゃん、私の心の奥まで行って、もっと近くなるよ）を加えることで、該音声パケットの特点をそっくり表す。 Produced by a professional company: Mainly, the audio packet provider produces the initial video by recording the video. Voice packet providers know better the characteristics (timbre, style, etc.) of their voice packets, so they record video that emphasizes the characteristics of the voice packets. Take A's voice packet video as an example. If this is a voice packet with a sweet and kind voice of a young and beautiful girl, dress up sweetly in the video and make some kind lines (older brother, Go deep into my heart and you'll get closer) to characterize the voice packet.

全ネットワークからマイニングしたもの：主に、キーワードを構成する方式により動画をマイニングする。同様にＡちゃんの音声パケット動画の作製を例とし、テンプレートに基づいて「Ａちゃんの親切な動画」、「Ａちゃんの甘い動画」等の検索ワードを自動的に構成し、検索ワードにより検索エンジンで検索し、大量の初期動画を取得する。 Mined from the whole network: Mainly, videos are mined by the method of constructing keywords. Similarly, taking the example of creating a voice packet video of A-chan, search words such as "A-chan's kind video" and "A-chan's sweet video" are automatically configured based on the template, and the search engine uses the search words. Search for and get a large amount of initial videos.

テンプレート化して作製したもの：主に、関連ピクチャーとセリフ（該音声パケットの声によりアナウンスする）とを融合させる方式により動画を作製する。依然としてＡちゃんの音声パケット動画の作製を例とし、Ａちゃんのプロフィールをキャッチコピーのテンプレートによりプロモーション字幕を生成し、例えば、「（プロフィール）私の音声パケットのご使用を歓迎します、（人称）あなたと一緒に安全に出かけましょう」等を生成する。Ａちゃんの音響合成モデルに基づき、プロモーション字幕に対応するプロモーション音声を合成し、プロモーション字幕、プロモーション音声、およびＡちゃんの個人写真に基づいて初期動画を作製する。 Created by template: Mainly, moving pictures are created by fusing related pictures and lines (announced by the voice of the voice packet). Still taking the production of A-chan's voice packet video as an example, promotional subtitles are generated by using A-chan's profile as a catch copy template, for example, "(Profile) Welcome to use my voice packet, (Personal) Let's go out safely with you." Based on Mr. A's sound synthesis model, a promotional voice corresponding to the promotional subtitles is synthesized, and an initial moving image is produced based on the promotional subtitles, the promotional voices, and A's personal photograph.

ｂ、音声パケットと動画との関連付け
以上の方式により、大量の初期動画を構成し、初期動画と音声パケットとの関連性に基づいてソートし、ソート結果に応じ、少なくとも１つの初期動画を候補動画として選択する必要がある。具体的な方式は以下のとおりである。 b. Association between audio packets and moving images By the above method, a large number of initial moving images are constructed, sorted based on the relevance between the initial moving images and the audio packets, and at least one initial moving image is selected as a candidate moving image according to the sorting result. should be selected as A specific method is as follows.

ａ）異なる動画ソースの動画に対する選択
異なるソース動画の優先度を定義するために、優先度ルールを事前に定義することができる。例えば、優先度は、高い順番で、専門業者が生産したもの、テンプレート化して作製したもの、および全ネットワークからマイニングしたものであってもよい。これにより、動画ソースの優先度に基づき、少なくとも１つの初期動画を候補動画として選択する。 a) Selection for videos of different video sources Priority rules can be pre-defined to define the priority of different source videos. For example, the priorities may be professionally produced, templated, and mined from the entire network, in descending order. Thereby, at least one initial video is selected as a candidate video based on the priority of video sources.

ｂ）同じソースの動画に対する選択
主に、第１ニューラルネットワークの方式により音声パケットの音声と動画とのコサイン類似度を計算し、コサイン類似度をソートし、且つ、ソート結果に応じ、少なくとも１つの初期動画を候補動画として選択する。 b) Selecting videos of the same source Mainly, the cosine similarity between the audio and video of the audio packet is calculated by the method of the first neural network, the cosine similarity is sorted, and according to the sorting result, at least one Select the initial videos as candidate videos.

図５Ｂに示す第１ニューラルネットワークモデルの構造模式図を参照し、２つの初期動画を例として候補動画の選択を行う。 With reference to the structural schematic diagram of the first neural network model shown in FIG. 5B, the two initial videos are used as examples to select candidate videos.

ここで、第１ニューラルネットワークは、特徴抽出層と、類似度確定層と、出力層とを備える。 Here, the first neural network includes a feature extraction layer, a similarity determination layer, and an output layer.

ここで、特徴抽出層は、初期動画に対して特徴を抽出し、動画特徴ベクトルを取得するように構成される動画特徴抽出層を含み、特徴抽出層は、音声パケットプロモーション音声に対して特徴を抽出し、プロモーション音声特徴ベクトルを取得するように構成される音声パケット特徴抽出層を更に含む。ここで、特徴抽出ネットワークは、ニューラルネットワークに基づいて実現される。 wherein the feature extraction layer includes a video feature extraction layer configured to extract features for the initial video and obtain a video feature vector; the feature extraction layer extracts features for the audio packet promotion audio; It further includes a voice packet feature extraction layer configured to extract and obtain a promotional voice feature vector. Here, the feature extraction network is implemented based on a neural network.

ここで、類似度確定層は、各動画特徴ベクトルとプロモーション音声特徴ベクトルとのコサイン類似度をそれぞれ計算するように構成される。 Here, the similarity determination layer is configured to calculate the cosine similarity between each video feature vector and the promotional audio feature vector respectively.

ここで、出力層は、各コサイン類似度に基づき、初期動画から少なくとも１つの候補動画を選択するように構成される。 Here, the output layer is configured to select at least one candidate movie from the initial movies based on each cosine similarity measure.

なお、第１ニューラルネットワークのトレーニング段階では、手動で表記する方式によりトレーニングコーパスを構築することができる。 In addition, in the training stage of the first neural network, a training corpus can be constructed by a manual notation method.

ｃ、動画タグの生成
各候補動画がいずれも異なる次元の分類タグを有し、例えば、音声提供者の個人イメージを反映するイメージタグ、音声提供者の声の特点を反映する音質タグ、および声のアナウンススタイルを反映するスタイルタグ等を含む。各次元にいずれも少なくとも１種のタグ値が対応し、例えば、音質タグには、甘い、かすれる等が含まれ、イメージタグには、御姉、ロリ、おじさん等が含まれ、スタイルタグには、アナウンス口調、ユーモア等が含まれる。 c. Generation of video tags Each candidate video has different dimensional classification tags, such as an image tag that reflects the personal image of the audio provider, a quality tag that reflects the features of the audio provider's voice, and a voice tag. including style tags etc. that reflect the announcement style of Each dimension corresponds to at least one kind of tag value. , announcement tone, humor, etc.

ある次元の具体的なタグ値の確定は、多分類のタスクと認定でき、次元がいくつあれば、いくつのタスクに対応する。これに基づき、第２ニューラルネットワークにより、マルチタスク学習の方法を採用して候補動画を分類し、各候補動画の分類タグを確定する。 Determining a specific tag value for a certain dimension can be recognized as a multi-class task, and how many dimensions correspond to how many tasks. Based on this, the second neural network adopts the method of multi-task learning to classify the candidate videos and determine the classification tags of each candidate video.

図５Ｃに示す第２ニューラルネットワークモデルの構造模式図を参照する。ここで、モデルの入力は、候補動画からサンプリングした複数のサンプリング画面であり、モデルの出力結果は、各次元の確率が最大のタグ値および各タグ値に対応する確率値である。 Please refer to the structural schematic diagram of the second neural network model shown in FIG. 5C. Here, the input of the model is a plurality of sampling screens sampled from the candidate moving images, and the output result of the model is the tag value with the maximum probability in each dimension and the probability value corresponding to each tag value.

ここで、モデルは特徴抽出層と出力層とを備える。 Here, the model comprises a feature extraction layer and an output layer.

ここで、特徴抽出層は、ニューラルネットワークに基づいて実現され、候補動画のサンプリング画面に対して特徴を抽出するように構成され、出力層は、複数の分類器を備え、異なる次元の分類タグのタグ値を確定するように構成される。 Here, the feature extraction layer is realized based on a neural network and configured to extract features for sampling screens of candidate videos, and the output layer comprises a plurality of classifiers for classifying tags of different dimensions. Configured to determine tag values.

なお、同じ動画に対して異なる次元の分類タグのタグ値を確定する時、分類タスクが関連するため、特徴抽出層のネットワークパラメータを共有する方式により共通特徴の抽出を実現することができる。 In addition, when determining the tag values of the classification tags of different dimensions for the same video, the classification task is related, so the common feature extraction can be realized by sharing the network parameters of the feature extraction layer.

第２ニューラルネットワークモデルのモデルトレーニング段階では、トレーニングコーパスは、手動で表記する方式により各サンプル動画に対応する分類タグを表記することができ、更に、コールドスタートの問題を解決するために、サンプル動画の文字記述またはサンプル動画を視聴するユーザに対応するペルソナを分類タグとすることもでき、トレーニングコーパスのデータ量を拡張し、更にトレーニングするモデルのモデル精度を向上させる。 In the model training stage of the second neural network model, the training corpus can be marked with a classification tag corresponding to each sample video by a manual marking method; or personas corresponding to users watching sample videos can also be used as classification tags to expand the data volume of the training corpus and further improve the model accuracy of the trained model.

なお、動画タグの生成を行う段階で採用される特徴抽出層と、音声パケットと動画との関連付けを行う段階で採用される特徴抽出層とは、ベースとなるニューラルネットワーク構造が同じであるか、または異なる。 It should be noted that the feature extraction layer employed at the stage of generating the video tag and the feature extraction layer employed at the stage of associating the audio packet with the video have the same base neural network structure, Or different.

２、音声パケット動画情報の記憶 2. Storage of audio packet video information

キーバリュー（ｋｅｙ－ｖａｌｕｅ）の方式によりバックエンドストレージシステムに記憶し、フォワードリンクと転置リンクとの２種のインデックス方式を採用することができる。ここで、フォワードリンクは、音声パケットＩＤをｋｅｙとし、候補動画の動画コンテンツおよび動画ソース、音声パケットプロモーション音声と候補動画とのコサイン類似度、音声パケット動画の分類タグをｖａｌｕｅとすることができる。ここで、転置リンクは、動画のタグ情報をｋｅｙとし、音声パケットＩＤをｖａｌｕｅとすることができる。以上の記憶方式により、個人化推薦のオンラインクエリのニーズを良好にサポートする。 It is stored in the back-end storage system in a key-value manner, and two indexing methods, forward link and inverted link, can be adopted. Here, the forward link can use the audio packet ID as the key, the video content and video source of the candidate video, the cosine similarity between the audio packet promotion audio and the candidate video, and the classification tag of the audio packet video as the value. Here, the transposed link can have the tag information of the moving image as the key and the audio packet ID as the value. The above storage schemes well support the needs of online queries for personalized recommendations.

３、音声パケットの個人化推薦 3. Personalized recommendations for voice packets

ａ、音声パケット候補のリコール
主に、ユーザのペルソナタグをｋｅｙとし、転置リンクをクエリすることによりリコールする。 a. Recalling Voice Packet Candidates Mainly, the user's persona tag is used as a key, and the recall is performed by querying the transposed link.

図５Ｄに示すユーザのペルソナタグの確定過程の模式図を参照し、協調フィルタリングの方式に基づき、ユーザ履歴行動に関連する履歴動画の分類タグと合わせてユーザの初期ペルソナタグを確定し、インタラクション行動、インタラクション回数に従って初期ペルソナタグに対して重み付けソートを行い、ユーザのペルソナタグを取得し、リストで表示する。ユーザのペルソナタグと音声パケットの候補動画の分類タグとの間の相関度に基づき、ターゲット動画をリコールし、リコールしたターゲット動画の属する音声パケットを候補音声パケットとする。 Referring to the schematic diagram of the determination process of the user's persona tag shown in FIG. 5D, based on the method of collaborative filtering, the user's initial persona tag is determined together with the classification tag of the history video related to the user's history behavior, and the interaction behavior is determined. , perform weighted sorting on the initial persona tags according to the number of interactions, get the user's persona tags, and display them in a list. Based on the degree of correlation between the user's persona tag and the classification tag of the candidate video of the voice packet, the target video is recalled, and the voice packet to which the recalled target video belongs is set as the candidate voice packet.

ここで、インタラクション行動は、ブラウジングすること、コメントすること、いいねをクリックすること、ダウンロードすること、および共有すること等の行動のうちの少なくとも１種を含む。ここで、インタラクション行動は、一部をブラウジングするおよび全てをブラウジングする等のようなインタラクションの程度を更に含む。 Here, the interaction behavior includes at least one of browsing, commenting, clicking like, downloading, sharing, and the like. Here, interaction behavior further includes degrees of interaction such as browsing some and browsing all.

ｂ、音声パケット候補のソート
上記音声パケットのリコール方法により、多くの候補音声パケットをリコールし、ソートモデルにより各候補音声パケットをソートし、これにより、候補音声パケットからターゲット音声パケットを選択する。各ユーザに対して１つのソートされたターゲット音声パケットのリストを返して表示する。 b. Sorting voice packet candidates According to the above voice packet recall method, a number of candidate voice packets are recalled, and each candidate voice packet is sorted according to the sorting model, thereby selecting a target voice packet from the candidate voice packets. Return and display one sorted list of target voice packets for each user.

ここで、ソートモデルは、ツリーモデルまたはニューラルネットワークモデルを採用することができ、フレームワークは、ｐｏｉｎｔｗｉｓｅ、ｐａｉｒｗｉｓｅ、ｌｉｓｔｗｉｓｅの成熟フレームワークを選択することができる。 Here, the sorting model can adopt a tree model or a neural network model, and the framework can choose pointwise, pairwise, listwise mature frameworks.

例えば、ソートモデル採用し、音声パケット自体のＣＴＲ（ＣｌｉｃｋＴｈｒｏｕｇｈＲａｔｅ、クリック通過率）特徴、音声パケット説明情報、候補音声パケットのソース情報、音声パケットプロモーション音声と対応するターゲット動画とのコサイン類似度、およびターゲット動画の分類タグに基づき、候補音声パケットをソートし、ソート結果に応じて少なくとも１つの候補音声パケットをターゲット音声パケットとして選択する。 For example, adopting a sorting model, the CTR (Click Through Rate) characteristics of the audio packet itself, the audio packet description information, the source information of the candidate audio packets, the cosine similarity between the audio packet promotional audio and the corresponding target video, and sorting the candidate audio packets based on the classification tags of the target video, and selecting at least one candidate audio packet as the target audio packet according to the sorting result.

ソートモデルのトレーニング段階では、トレーニングコーパスは、サンプルユーザのユーザインタラクション行動を用いて自動的に構築できる。ｌｉｓｔｗｉｓｅを例とし、同じサンプルユーザは、大量のサンプル音声パケットを含むサンプル動画をブラウジングし、これらのサンプル動画におけるソート関係は、ダウンロード行動が変換された動画、いいねをクリックした動画、コメントした動画、ブラウジングし終わった動画、ブラウジングし終わっていない動画、およびほぼブラウジングしていない動画という順に従い、高い順番でソートを設定することができる。 During the sorting model training phase, a training corpus can be automatically built using the user interaction behavior of sample users. Taking listwise as an example, the same sample user browsed a sample video containing a large amount of sample audio packets, and the sorting relationships in these sample videos were: videos whose download behavior was converted, videos that clicked like, videos that commented. , videos that have been browsed, videos that have not been browsed, and videos that have hardly been browsed, and sorting can be set in descending order.

ｃ、動画インタラクションの形式の表示
ターゲット音声パケットに関連するターゲット動画により、ターゲット音声パケットをユーザに推薦し、ユーザは、音声パケットの特点をより直観的かつ全面的に取得し、且つ印象が深く、ユーザの選択効率を大幅に向上させ、且つ、動画形式のブラウジング体験がより良好で、ユーザは情報をより簡単に取得することができる。 c. Display in the form of animation interaction Through the target animation associated with the target audio packet, recommend the target audio packet to the user, so that the user can get the features of the audio packet more intuitively and comprehensively, and have a strong impression. The user's selection efficiency is greatly improved, and the animation format browsing experience is better, and the user can obtain information more easily.

動画インタラクションの形式でターゲット音声パケットを表示し、具体的には、まず、共有、いいね、コメント機能を露出し、インタラクション方式をより簡単にするという方面と、該音声パケットをダウンロードする２次元コードピクチャーを動的に生成し、ターゲット動画の右上に置いて表示させ、ユーザがダウンロードを共有するステップを短縮し、ユーザの伝播効率を大幅に向上させるという方面と、スライド切り替え等の便利なインタラクション操作をサポートするという方面との３つの方面が含まれる。 Displaying the target voice packet in the form of video interaction, specifically, firstly, exposing the functions of sharing, liking, and commenting to make the interaction method easier, and a two-dimensional code for downloading the voice packet. The picture is dynamically generated and displayed in the upper right corner of the target video, shortening the steps for users to share downloads, greatly improving the user's propagation efficiency, and convenient interaction operations such as slide switching. There are three aspects: one that supports the

図６は、本発明の実施例に係る音声パケット推薦装置の構造図であり、該音声パケット推薦装置６００は、ターゲット表示動画選択モジュール６０１と、ターゲット音声パケット選択モジュール６０２と、ターゲット音声パケット推薦モジュール６０３とを備える。 FIG. 6 is a structural diagram of an audio packet recommendation device according to an embodiment of the present invention. The audio packet recommendation device 600 includes a target display video selection module 601, a target audio packet selection module 602, and a target audio packet recommendation module. 603.

ターゲット表示動画選択モジュール６０１は、音声パケットに関連する候補表示動画から少なくとも１つのターゲット表示動画をユーザのために選択し、前記ターゲット表示動画の属する音声パケットを候補音声パケットとするように構成される。 The target display animation selection module 601 is configured to select for a user at least one target display animation from candidate display animations associated with an audio packet, and to make the audio packet to which said target display animation belongs as a candidate audio packet. .

ターゲット音声パケット選択モジュール６０２は、前記候補音声パケットの属性情報および前記ターゲット表示動画の属性情報に基づき、前記候補音声パケットからターゲット音声パケットを前記ユーザのために選択するように構成される、 a target audio packet selection module 602 is configured to select a target audio packet for the user from the candidate audio packets based on attribute information of the candidate audio packets and attribute information of the target display video;

ターゲット音声パケット推薦モジュール６０３は、前記ターゲット音声パケットを前記ユーザに推薦するように構成される。 A target voice packet recommendation module 603 is configured to recommend the target voice packet to the user.

本発明の実施例は、ターゲット表示動画選択モジュールにより、音声パケットに関連する候補表示動画から少なくとも１つのターゲット表示動画をユーザのために選択し、ターゲット表示動画の属する音声パケットを候補音声パケットとし、ターゲット音声パケット選択モジュールにより、候補音声パケットの属性情報およびターゲット表示動画の属性情報に基づき、候補音声パケットからターゲット音声パケットをユーザのために選択し、ターゲット音声パケット推薦モジュールにより、ターゲット音声パケットを推ユーザに薦する。上記技術案を採用し、音声パケットに関連する動画を、音声パケットを確定する中間媒体としてターゲット音声パケットの自動推薦を行うことにより、ユーザが音声パケットを検索することから音声パケットが能動的にユーザを検索することへの変換を実現する。それと同時に、動画を介して音声パケットを確定し、ユーザは音声パケットを頻繁に試聴する必要がなく、ユーザが音声パケットを取得する利便性を高めるとともに、音声パケットの取得効率を向上させる。 An embodiment of the present invention selects at least one target display animation for a user from among candidate display animations associated with the audio packet by a target display animation selection module, the audio packet to which the target display animation belongs as a candidate audio packet; A target audio packet selection module selects a target audio packet for the user from the candidate audio packets based on the attribute information of the candidate audio packets and the attribute information of the target display video, and the target audio packet recommendation module recommends the target audio packet. Recommended to users. By adopting the above technical solution and automatically recommending the target audio packet using the video related to the audio packet as an intermediate medium for determining the audio packet, the user searches for the audio packet and the audio packet is actively sent to the user. To achieve the conversion to search for . At the same time, the audio packets are determined through the moving image, and the user does not need to listen to the audio packets frequently, so that the user's convenience of acquiring the audio packets is improved, and the acquisition efficiency of the audio packets is improved.

一実施例において、前記ターゲット表示動画選択モジュール６０１は、
前記ユーザのペルソナタグと前記音声パケットに関連する候補表示動画の分類タグとの相関度に基づき、少なくとも１つのターゲット表示動画を確定するように構成されるターゲット表示動画確定ユニットを備える。 In one embodiment, the target display video selection module 601 includes:
A target display animation determination unit configured to determine at least one target display animation based on a degree of correlation between the user's persona tag and classification tags of candidate display animations associated with the voice packet.

一実施例において、該装置は、
前記候補表示動画から画像を抽出するように構成される画像抽出モジュールと、
抽出した画像を予めトレーニングされた多分類モデルに入力し、モデルの出力結果に基づき、前記候補表示動画の少なくとも１つの分類タグを確定するように構成される分類タグ確定モジュールと、
を更に備える。 In one embodiment, the device comprises:
an image extraction module configured to extract an image from the candidate display video;
a classification tag determination module configured to input the extracted images into a pre-trained multi-classification model and determine at least one classification tag of the candidate display video based on the output results of the model;
Further prepare.

一実施例において、該装置は、
サンプル動画の文字記述、またはサンプル動画を視聴するユーザのペルソナ、またはサンプル動画の文字記述およびサンプル動画を視聴するユーザのペルソナを、前記サンプル動画のサンプル分類タグとするように構成されるサンプル分類タグ確定モジュールと、
前記サンプル動画から抽出したサンプル画像および前記サンプル分類タグに基づき、予め構築されたニューラルネットワークモデルをトレーニングし、前記多分類モデルを取得するように構成される多分類モデルトレーニングモジュールと、
を更に備える。 In one embodiment, the device comprises:
A sample classification tag configured to set the character description of the sample video, the persona of the user who watches the sample video, or the character description of the sample video and the persona of the user who watches the sample video as the sample classification tag of the sample video. a confirmation module;
a multi-classification model training module configured to train a pre-built neural network model based on the sample images extracted from the sample video and the sample classification tags to obtain the multi-classification model;
Further prepare.

一実施例において、前記多分類モデルは、各前記分類タグを確定する過程においてモデルパラメータを共有する。 In one embodiment, the multiple classification models share model parameters in the process of determining each of the classification tags.

一実施例において、前記分類タグは、イメージタグ、音質タグ、および音声スタイルタグのうちの少なくとも１種を含む。 In one embodiment, the classification tags include at least one of image tags, tone quality tags, and voice style tags.

一実施例において、該装置は、
前記音声パケットの初期表示動画を確定するように構成される初期表示動画確定モジュールと、
各前記候補表示動画の動画ソースの優先度に基づき、前記音声パケットに関連する前記候補表示動画を確定するように構成される候補表示動画確定モジュールと、
を更に備える。 In one embodiment, the device comprises:
an initial display video determination module configured to determine an initial display video of the audio packet;
a candidate display animation determination module configured to determine the candidate display animation associated with the audio packet based on the priority of the video source of each of the candidate display animations;
Further prepare.

一実施例において、該装置は、
前記音声パケットの初期表示動画を確定するように構成される初期表示動画確定モジュールと、
各前記初期表示動画と前記音声パケットの類似度に基づき、前記音声パケットに関連する前記候補表示動画を確定するように構成される候補表示動画確定モジュールと、
を更に備える。 In one embodiment, the device comprises:
an initial display video determination module configured to determine an initial display video of the audio packet;
a candidate display video determination module configured to determine the candidate display video related to the audio packet based on the similarity between each of the initially displayed video and the audio packet;
Further prepare.

一実施例において、前記初期表示動画確定モジュールは、
音声パケット提供者のプロモーション画像に基づき、音声パケットのプロモーションテキストを確定するように構成されるプロモーションテキスト確定ユニットと、
前記音声パケット提供者の音響合成モデルに基づき、前記プロモーションテキストに応じてプロモーション音声およびプロモーション字幕を生成するように構成されるプロモーション音声字幕生成ユニットと、
前記プロモーション画像、前記プロモーション音声および前記プロモーション字幕に基づき、前記初期表示動画を生成するように構成される初期表示動画生成ユニットと、
を備える。 In one embodiment, the initial display video confirmation module includes:
a promotional text determination unit configured to determine the promotional text of the voice packet based on the promotional image of the voice packet provider;
a promotional audio subtitle generation unit configured to generate promotional audio and promotional subtitles according to the promotional text based on the audio synthesis model of the audio packet provider;
an initial display video generation unit configured to generate the initial display video based on the promotional image, the promotional audio and the promotional subtitle;
Prepare.

一実施例において、前記初期表示動画確定モジュールは、
音声パケット提供者情報に基づいて動画検索ワードを構築するように構成される動画検索ワード構築ユニットと、
前記動画検索ワードに基づき、前記初期表示動画として、前記音声パケット提供者の動画を検索するように構成される初期表示動画生成ユニットと、
を備える。 In one embodiment, the initial display video confirmation module includes:
a video search word building unit configured to build a video search word based on the audio packet provider information;
an initial display animation generating unit configured to retrieve the animation of the voice packet provider as the initial display animation based on the animation search word;
Prepare.

一実施例において、前記ターゲット音声パケット推薦モジュール６０３は、
前記ターゲット音声パケットに関連するターゲット表示動画により、前記ターゲット音声パケットを前記ユーザに推薦するように構成されるターゲット音声パケット推薦ユニットを備える。 In one embodiment, the target voice packet recommendation module 603 includes:
A target audio packet recommending unit configured to recommend the target audio packet to the user according to a target display animation associated with the target audio packet.

上記音声パケット推薦装置は、本発明の実施例に係る各音声推薦方法を実行することができ、音声推薦方法の実行に対応する機能モジュールおよび有益な効果を備える。 The above voice packet recommendation apparatus is capable of implementing each voice recommendation method according to the embodiments of the present invention, and has functional modules and beneficial effects corresponding to the implementation of the voice recommendation method.

本発明の実施例によれば、本発明は、電子機器と、可読記憶媒体とを更に提供する。 According to an embodiment of the invention, the invention further provides an electronic device and a readable storage medium.

図７に示すように、本発明の実施例の音声パケット推薦方法を実現する電子機器のブロック図である。電子機器は、ラップトップ型コンピュータ、デスクトップ型コンピュータ、ワークステーション、パーソナルデジタルアシスタント、サーバ、ブレードサーバ、メインフレーム、および他の適切なコンピュータのような各形式のデジタルコンピュータを表すことを目的とする。電子機器は、携帯端末、携帯電話、スマートフォン、ウェララブル機器および他の類似する計算装置のような各形式の移動装置を表すこともできる。本発明に示されたコンポーネント、それらの接続、関係、およびそれらの機能は例示的なものに過ぎず、本発明に記載または要求される本発明の実現を限定するものではない。 As shown in FIG. 7, it is a block diagram of an electronic device that implements the voice packet recommendation method of an embodiment of the present invention. Electronic equipment is intended to represent each type of digital computer, such as laptop computers, desktop computers, workstations, personal digital assistants, servers, blade servers, mainframes, and other suitable computers. Electronic devices can also represent various types of mobile devices such as mobile handsets, mobile phones, smart phones, wearables and other similar computing devices. The components, their connections, relationships, and their functions shown in the invention are exemplary only and are not intended to limit the implementation of the invention described or required herein.

図７に示すように、該電子機器は、１つまたは複数のプロセッサ７０１と、メモリ７０２と、各コンポーネントを接続するように構成される高速インタフェースおよび低速インタフェースを含むインタフェースとを備える。各コンポーネントは、異なるバスで互に接続され、共通のマザーボードに取り付けられるかまたは必要に応じて他の方式で取り付けることができる。プロセッサは、電子機器内で実行される命令を処理することができ、メモリ内またはメモリ上に記憶されて外部の入力／出力装置（例えば、インタフェースにカップリングされた表示機器）にＧＵＩのグラフィクス情報を表示するための命令を含む。他の実施形態において、必要がある場合、複数のプロセッサおよび複数本のバスと、複数のメモリとを共に使用することができる。それと同様に、複数の電子機器に接続することができ、各機器は、一部の必要な動作（例えば、サーバアレイ、ブレードサーバ群、またはマルチプロセッサシステムとする）を提供する。図７において、１つのプロセッサ７０１と例とする。 As shown in FIG. 7, the electronic device comprises one or more processors 701, memory 702, and interfaces including high speed and low speed interfaces configured to connect each component. Each component is connected to each other by a different bus and can be mounted on a common motherboard or otherwise mounted as desired. The processor is capable of processing instructions executed within the electronic device to provide GUI graphics information stored in or on memory to an external input/output device (e.g., a display device coupled to the interface). Contains instructions for displaying the . In other embodiments, multiple processors and multiple buses and multiple memories can be used together if desired. Likewise, multiple electronic devices can be connected, each device providing some required operation (eg, a server array, blade servers, or multi-processor system). In FIG. 7, one processor 701 is taken as an example.

メモリ７０２は、本発明に係る非一時的なコンピュータ可読記憶媒体である。ここで、本発明に係る音声パケット推薦方法を前記少なくとも１つのプロセッサに実行させるために、前記メモリには少なくとも１つのプロセッサにより実行可能な命令が記憶されている。本発明の非一時的なコンピュータ可読記憶媒体はコンピュータ命令を記憶し、該コンピュータ命令は、本発明に係る音声パケット推薦方法をコンピュータに実行させるように設定される。 Memory 702 is a non-transitory computer-readable storage medium according to the present invention. Here, instructions executable by at least one processor are stored in the memory to cause the at least one processor to execute the voice packet recommendation method according to the present invention. The non-transitory computer-readable storage medium of the present invention stores computer instructions, the computer instructions configured to cause a computer to perform the voice packet recommendation method of the present invention.

メモリ７０２は、非一時的なコンピュータ可読記憶媒体として、非一時的なソフトウェアプログラム、非一時的なコンピュータ実行可能プログラムおよびモジュール、例えば、本発明の実施例における音声パケット推薦方法に対応するプログラム命令／モジュール（例えば、図面６に示すターゲット表示動画選択モジュール６０１、ターゲット音声パケット選択モジュール６０２、およびターゲット音声パケット推薦モジュール６０３）を記憶するように構成されてもよい。プロセッサ７０１は、メモリ７０２に記憶された非一時的なソフトウェアプログラム、命令およびモジュールを実行することにより、サーバの各機能アプリケーションおよびデータ処理を実行し、即ち、上記方法の実施例における音声パケット推薦方法を実現する。 Memory 702 is a non-transitory computer-readable storage medium for storing non-transitory software programs, non-transitory computer-executable programs and modules, e.g., program instructions/ It may be configured to store modules (eg, target display video selection module 601, target audio packet selection module 602, and target audio packet recommendation module 603 shown in FIG. 6). The processor 701 performs each functional application and data processing of the server by executing the non-transitory software programs, instructions and modules stored in the memory 702, namely the voice packet recommendation method in the above method embodiments. Realize

メモリ７０２は、プログラム記憶エリアおよびデータ記憶エリアを備えてもよく、ここで、プログラム記憶エリアは、オペレーティングシステム、少なくとも１つの機能に必要なアプリケーションプログラムを記憶することができ、データ記憶エリアは、音声パケット推薦方法を実現する電子機器の使用により作成されたデータ等を記憶することができる。また、メモリ７０２は、高速ランダムアクセスメモリを含んでもよく、少なくとも１つの磁気ディスク記憶機器、フラッシュメモリ、または他の非一時的な固体記憶機器のような非一時的なメモリを更に含んでもよい。いくつかの実施例において、メモリ７０２は、プロセッサ７０１に対してリモートに設けられたメモリを含むことが好ましく、これらのリモートメモリは、ネットワークを介して音声パケット推薦方法を実現する電子機器に接続することができる。上記ネットワークの実例は、インターネット、イントラネット、ローカルエリアネットワーク、移動体通信ネットワークおよびその組み合わせを含んでもよいが、それらに限定されない。 The memory 702 may comprise a program storage area and a data storage area, where the program storage area can store an operating system, application programs required for at least one function, and the data storage area can store voice Data or the like created by using an electronic device that implements the packet recommendation method can be stored. Memory 702 may also include high speed random access memory and may also include at least one non-transitory memory such as magnetic disk storage, flash memory, or other non-transitory solid state storage. In some embodiments, the memory 702 preferably includes memories located remotely to the processor 701, and these remote memories connect over a network to the electronic device that implements the voice packet recommendation method. be able to. Examples of such networks may include, but are not limited to, the Internet, intranets, local area networks, mobile communication networks and combinations thereof.

音声パケット推薦方法を実現する電子機器は、入力装置７０３と、出力装置７０４とを更に備えてもよい。プロセッサ７０１、メモリ７０２、入力装置７０３および出力装置７０４は、バスまたは他の方式で接続することができ、図７において、バスを介して接続することを例とする。 The electronic device that implements the voice packet recommendation method may further include an input device 703 and an output device 704 . Processor 701, memory 702, input device 703 and output device 704 may be connected by a bus or in other manners, and the connection via a bus is taken as an example in FIG.

入力装置７０３は、入力された数字または文字情報を受信し、音声パケット推薦方法を実現する電子機器のユーザ設定および機能制御に関連するキー信号入力を生成することができ、例えば、タッチパネル、キーパッド、マウス、トラックパッド、タッチパッド、インジケータ、１つまたは複数のマウスボタン、トラックボール、ジョイスティック等の入力装置である。出力装置７０４は、表示機器、補助照明装置（例えば、ＬＥＤ）、および触覚フィードバック装置（例えば、振動モータ）等を含んでもよい。該表示機器は、液晶ディスプレイ（ＬＣＤ）、発光ダイオード（ＬＥＤ）ディスプレイ、およびプラズマディスプレイを含んでもよいが、これらに限定されない。いくつかの実施形態において、表示機器はタッチパネルであってもよい。 The input device 703 is capable of receiving input numeric or character information and generating key signal inputs related to user settings and functional control of electronic devices implementing the voice packet recommendation method, such as touch panels, keypads, etc. , a mouse, a trackpad, a touchpad, an indicator, one or more mouse buttons, a trackball, a joystick, or the like. Output devices 704 may include display devices, supplemental lighting devices (eg, LEDs), tactile feedback devices (eg, vibration motors), and the like. The display device may include, but is not limited to, a liquid crystal display (LCD), a light emitting diode (LED) display, and a plasma display. In some embodiments, the display device may be a touch panel.

ここで説明するシステムおよび技術の様々な実施形態は、デジタル電子回路システム、集積回路システム、特定用途向けＡＳＩＣ（特定用途向け集積回路）、コンピュータハードウェア、ファームウェア、ソフトウェア、またはそれらの組み合わせで実現できる。これらの各実施形態は以下を含んでもよい。１つまたは複数のコンピュータプログラムに実施され、該１つまたは複数のコンピュータプログラムは、少なくとも１つのプログラマブルプロセッサを含むプログラマブルシステムで実行する、または解釈する、または実行して解釈することができ、該プログラマブルプロセッサは、ストレージシステム、少なくとも１つの入力装置、および少なくとも１つの出力装置からデータおよび命令を受信し、且つデータおよび命令を、該ストレージシステム、該少なくとも１つの入力装置、および該少なくとも１つの出力装置に伝送することができる専用または汎用のプログラマブルプロセッサであってもよい。 Various embodiments of the systems and techniques described herein can be implemented in digital electronic circuit systems, integrated circuit systems, application specific integrated circuits (ASICs), computer hardware, firmware, software, or combinations thereof. . Each of these embodiments may include the following. embodied in one or more computer programs, said one or more computer programs being executable or interpreted, or capable of being executed and interpreted by a programmable system comprising at least one programmable processor; The processor receives data and instructions from the storage system, at least one input device, and at least one output device, and transmits data and instructions to the storage system, the at least one input device, and the at least one output device. It may be a dedicated or general purpose programmable processor capable of transmitting to.

これらの計算プログラム（プログラム、ソフトウェア、ソフトウェアアプリケーション、またはコードとも呼ばれる）は、プログラマブルプロセッサの機械命令を含み、且つ、高度なプロセスまたはオブジェクト指向プログラミング言語、またはアセンブリ／機械言語を用いてこれらの計算プログラムを実施することができる。本発明に使用されるように、「機械可読媒体」および「コンピュータ可読媒体」という用語は、機械命令またはデータをプログラマブルプロセッサに提供するように構成される任意のコンピュータプログラム製品、機器、または装置（例えば、磁気ディスク、光ディスク、メモリ、プログラマブル論理機器（ＰＬＤ））を意味し、機械可読信号としての機械命令を受信する機械可読媒体を含む。「機械可読信号」という用語は、機械命令またはデータをプログラマブルプロセッサに提供するための任意の信号を意味する。 These computational programs (also called programs, software, software applications, or code) contain machine instructions for programmable processors and are written using high-level process or object-oriented programming languages, or assembly/machine language. can be implemented. As used herein, the terms "machine-readable medium" and "computer-readable medium" refer to any computer program product, apparatus, or apparatus configured to provide machine instructions or data to a programmable processor ( For example, magnetic disk, optical disk, memory, programmable logic device (PLD)) and includes a machine-readable medium for receiving machine instructions as machine-readable signals. The term "machine-readable signal" means any signal for providing machine instructions or data to a programmable processor.

ユーザとのインタラクションを提供するために、ここで説明するシステムおよび技術をコンピュータで実施することができ、該コンピュータは、ユーザに情報を表示するための表示装置（例えば、ＣＲＴ（陰極線管）またはＬＣＤ（液晶ディスプレイ）モニタ）と、ユーザがそれにより入力をコンピュータに提供することができるキーボードおよび指向装置（例えば、マウスまたはトラックボール）とを有する。他の種類の装置は、更にユーザとのインタラクションを提供するように構成されてもよい。例えば、ユーザに提供されるフィードバックは、任意の形式のセンシングフィードバック（例えば、視覚フィードバック、聴覚フィードバック、または触覚フィードバック）であってもよく、且つ、任意の形式（音入力、音声入力または、触覚入力を含む）でユーザからの入力を受信することができる。 In order to provide interaction with a user, the systems and techniques described herein can be implemented in a computer, which includes a display device (e.g., a CRT (cathode ray tube) or LCD) for displaying information to the user. (liquid crystal display) monitor), and a keyboard and directional device (eg, mouse or trackball) by which the user can provide input to the computer. Other types of devices may also be configured to provide user interaction. For example, the feedback provided to the user may be any form of sensing feedback (e.g., visual, auditory, or tactile feedback) and any form (sound, speech, or tactile input). ) can receive input from the user.

ここで説明するシステムおよび技術を、バックグラウンドコンポーネントを含むコンピューティングシステム（例えば、データサーバとする）、または中間コンポーネントを含むコンピューティングシステム（例えば、アプリケーションサーバ）、またはフロントエンドコンポーネントを含むコンピューティングシステム（例えば、ユーザがそれによりここで説明するシステムおよび技術の実施形態とインタラクションできるグラフィカルユーザインタフェースまたはネットワークブラウザを有するユーザコンピュータ）、またはこのようなバックグラウンドコンポーネント、中間コンポーネント、またはフロントエンドコンポーネントの任意の組み合わせを含むコンピューティングシステムに実施することができる。任意の形式または媒体のデジタルデータ通信（例えば、通信ネットワーク）により、システムのコンポーネントを互に接続することができる。通信ネットワークの例は、ローカルエリアネットワーク（ＬＡＮ）、ワイドエリアネットワーク（ＷＡＮ）、ブロックチェーンネットワーク、およびインターネットを含む。 The systems and techniques described herein may be a computing system that includes background components (e.g., a data server), or a computing system that includes intermediate components (e.g., an application server), or a computing system that includes front-end components. (e.g., user computers having graphical user interfaces or network browsers by which users can interact with embodiments of the systems and techniques described herein), or any such background, intermediate, or front-end components. It can be implemented in any computing system that contains a combination. Any form or medium of digital data communication (eg, a communication network) can connect the components of the system to each other. Examples of communication networks include local area networks (LANs), wide area networks (WANs), blockchain networks, and the Internet.

コンピュータシステムはクライアントおよびサーバを含んでもよい。クライアントとサーバとは、一般的に互いに離れ、且つ、通常、通信ネットワークを介してインタラクションを行う。対応するコンピュータで実行されて互いにクライアント－サーバ関係を持つコンピュータプログラムにより、クライアントとサーバとの関係を生成する。 The computer system can include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server is created by computer programs running on corresponding computers and having a client-server relationship to each other.

本発明の実施例によればの技術案は、音声パケットに関連する候補専用線動画から少なくとも１つのターゲット表示動画をユーザのために選択する、ターゲット表示動画の属する音声パケットを候補音声パケットとし、候補音声パケットの属性情報およびターゲット表示動画の属性情報に基づき、候補音声パケットからターゲット音声パケットをユーザのために選択し、ターゲット音声パケットをユーザに推薦する。上記技術案を採用し、音声パケットに関連する動画を、音声パケットを確定する中間媒体としてターゲット音声パケットの自動推薦を行うことにより、ユーザが音声パケットを検索することから音声パケットが能動的にユーザを検索することへの変換を実現する。それと同時に、動画を介して音声パケットを確定し、ユーザは音声パケットを頻繁に試聴する必要がなく、ユーザが音声パケットを取得する利便性を高めるとともに、音声パケットの取得効率を向上させる。 A technical solution according to an embodiment of the present invention is to select at least one target display animation for a user from candidate leased line animations associated with the voice packet, wherein the voice packet to which the target display animation belongs is the candidate voice packet; Based on the attribute information of the candidate audio packets and the attribute information of the target display video, a target audio packet is selected for the user from the candidate audio packets, and the target audio packet is recommended to the user. By adopting the above technical solution and automatically recommending the target audio packet using the video related to the audio packet as an intermediate medium for determining the audio packet, the user searches for the audio packet and the audio packet is actively sent to the user. To achieve the conversion to search for . At the same time, the audio packets are determined through the moving image, and the user does not need to listen to the audio packets frequently, so that the user's convenience of acquiring the audio packets is improved, and the acquisition efficiency of the audio packets is improved.

上記に示す様々な形式のフローを用い、ステップを並べ替え、追加または削除することができることを理解すべきである。例えば、本発明に記載の各ステップは、並列に実行されてもよいし、順に実行されてもよいし、異なる順序で実行されてもよく、本発明に開示された技術案の所望する結果を達成できる限り、本発明はここで限定しない。 It should be appreciated that steps may be rearranged, added or deleted from the various forms of flow shown above. For example, each step described in the present invention may be performed in parallel, sequentially, or in a different order, and the desired result of the technical solution disclosed in the present invention may be obtained. So long as it is achievable, the invention is not limited here.

上記具体的な実施形態は、本発明の保護範囲を限定するものではない。当業者は、設計要求および他の要因に基づき、様々な修正、組み合わせ、サブ組み合わせおよび代替が可能であることを理解すべできる。本発明の精神および原則内で行われる任意の修正、均等置換および改良等は、いずれも本発明の保護範囲内に含まれているべきである。
The above specific embodiments are not intended to limit the protection scope of the present invention. Those skilled in the art should appreciate that various modifications, combinations, subcombinations and substitutions are possible based on design requirements and other factors. Any modification, equivalent substitution, improvement, etc. made within the spirit and principle of the present invention shall all fall within the protection scope of the present invention.

Claims

音声パケット推薦装置により実行される音声パケット推薦方法であって、
音声パケットに関連する候補表示動画から少なくとも１つのターゲット表示動画をユーザのために選択し、前記ターゲット表示動画の属する音声パケットを候補音声パケットとすることと、
前記候補音声パケットの属性情報および前記ターゲット表示動画の属性情報に基づき、前記候補音声パケットからターゲット音声パケットを前記ユーザのために選択することと、
前記ターゲット音声パケットを前記ユーザに推薦することと、を含む、
音声パケット推薦方法。 A voice packet recommendation method executed by a voice packet recommendation device, comprising:
selecting for the user at least one target display animation from among the candidate display animations associated with the audio packet, the audio packet to which the target display animation belongs as a candidate audio packet;
selecting a target audio packet for the user from the candidate audio packets based on attribute information of the candidate audio packets and attribute information of the target display video;
recommending the target voice packet to the user;
Voice packet recommendation method.

音声パケットに関連する候補表示動画から少なくとも１つのターゲット表示動画をユーザのために選択することは、
前記ユーザのペルソナタグと前記音声パケットに関連する候補表示動画の分類タグとの相関度に基づき、少なくとも１つのターゲット表示動画を確定することを含む、
請求項１に記載の方法。 Selecting for the user at least one target display animation from the candidate display animations associated with the audio packet comprises:
determining at least one target display video based on a degree of correlation between the user's persona tag and classification tags of candidate display videos associated with the voice packet;
The method of claim 1.

前記候補表示動画から画像を抽出することと、
抽出した画像を予めトレーニングされた多分類モデルに入力し、モデルの出力結果に基づき、前記候補表示動画の少なくとも１つの分類タグを確定することと、を更に含む、
請求項２に記載の方法。 extracting an image from the candidate display video;
inputting the extracted images into a pre-trained multi-classification model, and determining at least one classification tag for the candidate display video based on the output results of the model;
3. The method of claim 2.

サンプル動画の文字記述、またはサンプル動画を視聴するユーザのペルソナ、またはサンプル動画の文字記述およびサンプル動画を視聴するユーザのペルソナを、前記サンプル動画のサンプル分類タグとすることと、
前記サンプル動画から抽出したサンプル画像および前記サンプル分類タグに基づき、予め構築されたニューラルネットワークモデルをトレーニングし、前記多分類モデルを取得することと、を更に含む、
請求項３に記載の方法。 making the character description of the sample video, the persona of the user who watches the sample video, or the character description of the sample video and the persona of the user who watches the sample video, the sample classification tags of the sample video;
training a pre-built neural network model based on the sample images extracted from the sample video and the sample classification tags to obtain the multi-classification model;
4. The method of claim 3.

前記多分類モデルは、各前記分類タグを確定する過程においてモデルパラメータを共有する、
請求項３に記載の方法。 the multi-classification model shares model parameters in the process of determining each of the classification tags;
4. The method of claim 3.

前記分類タグは、イメージタグ、音質タグ、および音声スタイルタグのうちの少なくとも１種を含む、
請求項２に記載の方法。 the classification tags include at least one of an image tag, a sound quality tag, and a voice style tag;
3. The method of claim 2.

前記音声パケットの初期表示動画を確定することと、
各前記初期表示動画の動画ソースの優先度に基づき、前記音声パケットに関連する前記候補表示動画を確定することと、を更に含む、
請求項１に記載の方法。 determining an initial display video of the audio packet;
determining the candidate display videos associated with the audio packet based on the priority of the video source of each of the initially displayed videos;
The method of claim 1.

前記音声パケットの初期表示動画を確定することと、
各前記初期表示動画と前記音声パケットの類似度に基づき、前記音声パケットに関連する前記候補表示動画を確定することと、を更に含む、
請求項１に記載の方法。 determining an initial display video of the audio packet;
determining the candidate display videos associated with the audio packets based on the similarity between each of the initial display videos and the audio packets;
The method of claim 1.

前記音声パケットの初期表示動画を確定することは、
音声パケット提供者のプロモーション画像に基づき、音声パケットのプロモーションテキストを確定することと、
前記音声パケット提供者の音響合成モデルに基づき、前記プロモーションテキストに応じてプロモーション音声およびプロモーション字幕を生成することと、
前記プロモーション画像、前記プロモーション音声および前記プロモーション字幕に基づき、前記初期表示動画を生成することと、を含む、
請求項７または８に記載の方法。 Determining an initial display video of the audio packet includes:
determining a promotional text for the voice packet based on the promotional image of the voice packet provider;
generating promotional audio and promotional subtitles according to the promotional text based on the audio synthesis model of the audio packet provider;
generating the initial display video based on the promotional image, the promotional audio and the promotional subtitles;
9. A method according to claim 7 or 8.

前記音声パケットの初期表示動画を確定することは、
音声パケット提供者の情報に基づいて動画検索ワードを構築することと、
前記動画検索ワードに基づき、前記初期表示動画として、前記音声パケット提供者の動画を検索することと、を含む、
請求項７または８に記載の方法。 Determining an initial display video of the audio packet includes:
constructing video search words based on the information of the audio packet provider;
retrieving a video of the voice packet provider as the initially displayed video based on the video search word;
9. A method according to claim 7 or 8.

前記ターゲット音声パケットを前記ユーザに推薦することは、
前記ターゲット音声パケットに関連するターゲット表示動画により、前記ターゲット音声パケットを前記ユーザに推薦することを含む、
請求項１～８のいずれか１項に記載の方法。 Recommending the target voice packet to the user comprises:
recommending the target audio packet to the user with a target display video associated with the target audio packet;
The method according to any one of claims 1-8.

音声パケットに関連する候補表示動画から少なくとも１つのターゲット表示動画をユーザのために選択し、前記ターゲット表示動画の属する音声パケットを候補音声パケットとするように構成されるターゲット表示動画選択モジュールと、
前記候補音声パケットの属性情報および前記ターゲット表示動画の属性情報に基づき、前記候補音声パケットからターゲット音声パケットを前記ユーザのために選択するように構成されるターゲット音声パケット選択モジュールと、
前記ターゲット音声パケットを前記ユーザに推薦するように構成されるターゲット音声パケット推薦モジュールと、を備える、
音声パケット推薦装置。 a target display animation selection module configured to select for a user at least one target display animation from among candidate display animations associated with an audio packet, the audio packet to which the target display animation belongs as a candidate audio packet;
a target audio packet selection module configured to select a target audio packet for the user from the candidate audio packets based on attribute information of the candidate audio packets and attribute information of the target display video;
a target voice packet recommendation module configured to recommend the target voice packet to the user;
Voice packet recommender.

少なくとも１つのプロセッサと、
前記少なくとも１つのプロセッサに通信接続されたメモリと、を備える電子機器であって、
前記メモリには、前記少なくとも１つのプロセッサによって実行可能な命令が記憶され、前記命令は、前記少なくとも１つのプロセッサが請求項１～１１のいずれか１項に記載の音声パケット推薦方法を実行可能であるように、前記少なくとも１つのプロセッサにより実行される、
電子機器。 at least one processor;
a memory communicatively coupled to the at least one processor, comprising:
The memory stores instructions executable by the at least one processor, the instructions enabling the at least one processor to execute the voice packet recommendation method according to any one of claims 1 to 11. as executed by the at least one processor;
Electronics.

請求項１～１１のいずれか１項に記載の音声パケット推薦方法をコンピュータに実行させるためのプログラム。 A program for causing a computer to execute the voice packet recommendation method according to any one of claims 1 to 11.