JP2023140974A

JP2023140974A - Information processing apparatus, information processing system, and program

Info

Publication number: JP2023140974A
Application number: JP2022047070A
Authority: JP
Inventors: 淳秦野; Jun Hatano
Original assignee: Fujifilm Business Innovation Corp
Current assignee: Fujifilm Business Innovation Corp
Priority date: 2022-03-23
Filing date: 2022-03-23
Publication date: 2023-10-05

Abstract

To reduce the workload on a user when processing specific data of a plurality of pieces of data of static images generated from video image data of a plurality of sheets of documents.SOLUTION: A management server 10 serving as an information processing apparatus includes a control unit 11. The control unit 11 includes: a video voice management unit 102 which manages, on the same time axis, video image data obtained by imaging a plurality of sheets of documents and voice data recorded in parallel with imaging, in association with each other; a voice detection unit 103 which detects a voice command from the recorded voice data; an image generation unit 104 which generates static image data for the plurality of documents from the video image data; and a processing control unit 106 which controls executing predetermined processing on the static image data of the documents, the data being to be captured at the time when the voice command is issued.SELECTED DRAWING: Figure 4

Description

本発明は、情報処理装置、情報処理システム、およびプログラムに関する。 The present invention relates to an information processing device, an information processing system, and a program.

ユーザが、原稿となる用紙に形成された画像を電子化したい場合であっても、状況によっては画像を読み取る機能を有する装置（例えば、スキャナ装置や複合機等）を使用できないことがある。このような状況において、ユーザは、スマートフォン等の撮像機能を用いて原稿の動画像を撮像し、その動画像のデータから原稿の静止画像のデータを生成することがあり、関連する技術も存在する（例えば、特許文献１）。 Even if a user wants to digitize an image formed on a paper serving as a manuscript, depending on the situation, it may not be possible to use a device (for example, a scanner device, a multifunction device, etc.) that has a function of reading the image. In such situations, the user may capture a moving image of the document using the imaging function of a smartphone, etc., and generate still image data of the document from the data of the moving image, and related technologies also exist. (For example, Patent Document 1).

特開第２０１０－０９８６１５号公報Japanese Patent Application Publication No. 2010-098615

しかしながら、原稿が複数枚である場合には、特定の原稿に対して他の原稿と異なる処理を行わなければならないことがある。このような場合、ユーザは、生成された複数の静止画像の中から対象となる原稿を探し、手動で処理を行うことになるが、生成された静止画像のデータが大量に存在する場合には、対象となる原稿を探すために時間を要することになるため、ユーザの作業負担が大きい。 However, when there are multiple documents, it may be necessary to perform different processing on a particular document than on other documents. In such cases, the user must search for the target document among the multiple generated still images and process it manually, but if there is a large amount of generated still image data, , it takes time to search for the target manuscript, which places a heavy workload on the user.

本発明の目的は、複数枚の原稿の動画像のデータから生成される複数の静止画像のデータのうち、特定のデータに対する処理を行う場合におけるユーザの作業負担を従来よりも軽減させることにある。 An object of the present invention is to reduce the user's workload when processing specific data among a plurality of still image data generated from moving image data of a plurality of originals. .

請求項１に記載された発明は、プロセッサを備え、前記プロセッサは、撮像された複数枚の原稿の動画像のデータと、当該撮像と並行して録音された音声のデータとを対応付けて同一の時間軸で管理し、前記音声のデータから予め定められた音声を検出し、前記動画像のデータから生成される前記複数枚の原稿の静止画像のデータのうち、前記予め定められた音声が発せられたタイミングで撮像対象とされた原稿の静止画像のデータに対し、予め定められた処理を実行する制御を行うことを特徴とする、情報処理装置である。
請求項２に記載された発明は、前記動画像のデータが、前記複数枚の原稿の各々に形成された画像が連続的に撮像された１の動画像のデータであることを特徴とする、請求項１に記載の情報処理装置である。
請求項３に記載された発明は、前記１の動画像のデータが、ユーザの第１の動作と第２の動作に基づいて撮像されたものであり、前記音声のデータが、当該ユーザから発せられた音声が録音されたものであることを特徴とする、請求項２に記載の情報処理装置である。
請求項４に記載された発明は、前記第１の動作が、前記複数枚の原稿を撮像する動作であり、前記第２の動作が、当該複数枚の原稿の各々を被写体とするための動作であることを特徴とする、請求項３に記載の情報処理装置である。
請求項５に記載された発明は、前記複数枚の原稿の各々を被写体とするための動作が、前記複数枚の原稿を１枚ずつめくる動作であることを特徴とする、請求項４に記載の情報処理装置である。
請求項６に記載された発明は、前記プロセッサは、前記予め定められた処理を実行する制御として、前記予め定められた音声に対応する処理を実行する制御を行うことを特徴とする、請求項１に記載の情報処理装置である。
請求項７に記載された発明は、前記予め定められた音声および前記予め定められた処理の各々が複数存在し、当該予め定められた音声および当該予め定められた処理の各々が、予め定められたデータベースにおいて対応付けられて記憶されていることを特徴とする、請求項６に記載の情報処理装置である。
請求項８に記載された発明は、前記予め定められた音声が、前記予め定められた音声であることを示す音声と、前記処理の対象となる原稿を指定するための音声と、当該処理の内容を示す音声と、録音された音声を削除するための音声とのうち少なくとも１の音声であることを特徴とする、請求項７に記載の情報処理装置である。
請求項９に記載された発明は、前記処理の対象となる原稿を指定するための音声、および録音された音声を削除するための音声には、見開きの複数枚の原稿のうち対象となる１の原稿を指定するための音声が含まれることを特徴とする、請求項８に記載の情報処理装置である。
請求項１０に記載された発明は、前記処理の内容を示す音声が、前記処理の対象となる前記静止画像のデータの出力形式、属性、および構成の各々を指定するための音声のうちいずれか１以上であることを特徴とする、請求項８に記載の情報処理装置である。
請求項１１に記載された発明は、前記静止画像のデータの出力形式を指定するための音声が、当該静止画像のデータのファイルの形式、色、および向きのうち、いずれか１以上を指定するための音声であることを特徴とする、請求項１０に記載の情報処理装置である。
請求項１２に記載された発明は、前記静止画像のデータの属性を指定するための音声が、当該静止画像のデータの印刷の可否、編集の可否、転記の可否、暗号化の有無、および文字認識時の言語のうち、いずれか１以上を指定するための音声であることを特徴とする、請求項１０に記載の情報処理装置である。
請求項１３に記載された発明は、前記静止画像のデータの構成を指定するための音声が、当該静止画像のデータにおける原稿の挿入および原稿の削除のうち、いずれか１以上を指定するための音声であることを特徴とする、請求項１０に記載の情報処理装置である。
請求項１４に記載された発明は、複数枚の原稿の動画像を撮像する撮像手段と、前記撮像と並行して音声を録音する録音手段と、撮像された前記動画像のデータと、録音された前記音声のデータとを取得し、当該動画像のデータと、当該音声のデータとを対応付けて同一の時間軸で管理する管理手段と、前記音声のデータから予め定められた音声を検出する検出手段と、前記動画像のデータから前記複数枚の原稿の静止画像のデータを生成する生成手段と、生成した前記静止画像のデータのうち、前記予め定められた音声が発されたタイミングで撮像対象とされた原稿の静止画像のデータに対し、予め定められた処理を実行する制御を行う処理実行制御手段と、を有することを特徴とする、情報処理システムである。
請求項１５に記載された発明は、コンピュータに、撮像された複数枚の原稿の動画像のデータと、当該撮像と並行して録音された音声のデータとを対応付けて同一の時間軸で管理する機能と、前記音声のデータから予め定められた音声を検出する機能と、前記動画像のデータから生成される前記複数枚の原稿の静止画像のデータのうち、前記予め定められた音声が発せられたタイミングで撮像対象とされた原稿の静止画像のデータに対し、予め定められた処理を実行する制御を行う機能と、を実現させるためのプログラムである。 The invention described in claim 1 includes a processor, and the processor associates data of a moving image of a plurality of imaged documents with data of audio recorded in parallel with the imaging and generates the same data. , a predetermined sound is detected from the audio data, and the predetermined sound is detected from the still image data of the plurality of manuscripts generated from the moving image data. The information processing apparatus is characterized in that it performs control to perform predetermined processing on data of a still image of a document to be imaged at the timing when the information is issued.
The invention as set forth in claim 2 is characterized in that the moving image data is data of one moving image in which images formed on each of the plurality of originals are sequentially captured. An information processing device according to claim 1.
In the invention described in claim 3, the first moving image data is captured based on a first motion and a second motion of a user, and the audio data is captured based on a first motion and a second motion of a user. 3. The information processing apparatus according to claim 2, wherein the recorded voice is recorded.
In the invention described in claim 4, the first operation is an operation of capturing images of the plurality of originals, and the second operation is an operation of taking each of the plurality of originals as a subject. The information processing device according to claim 3, characterized in that:
The invention described in claim 5 is characterized in that the operation for making each of the plurality of manuscripts a subject is an operation of turning over the plurality of manuscripts one by one. This is an information processing device.
The invention set forth in claim 6 is characterized in that the processor performs control to execute a process corresponding to the predetermined audio as the control to execute the predetermined process. 1. The information processing apparatus according to 1.
In the invention described in claim 7, there is a plurality of each of the predetermined sounds and the predetermined processing, and each of the predetermined sounds and the predetermined processing is a predetermined sound. 7. The information processing apparatus according to claim 6, wherein the information processing apparatus is stored in association with each other in a database.
The invention described in claim 8 provides that the predetermined voice includes a voice indicating that the predetermined voice is the predetermined voice, a voice for specifying a document to be processed, and a voice for specifying the document to be processed. 8. The information processing apparatus according to claim 7, wherein the information processing apparatus is at least one of a voice indicating the content and a voice for deleting the recorded voice.
In the invention described in claim 9, the audio for specifying the original to be processed and the audio for deleting the recorded audio include one of the originals to be processed among a plurality of double-page spreads. 9. The information processing apparatus according to claim 8, further comprising a voice for specifying a document.
In the invention as set forth in claim 10, the sound indicating the content of the processing is any one of sounds for specifying each of an output format, an attribute, and a configuration of data of the still image to be processed. 9. The information processing device according to claim 8, wherein the number is one or more.
In the invention described in claim 11, the sound for specifying the output format of the still image data specifies any one or more of the file format, color, and orientation of the still image data. 11. The information processing apparatus according to claim 10, wherein the information processing apparatus is a voice for.
In the invention described in claim 12, the voice for specifying the attributes of the data of the still image includes whether or not the still image data can be printed, edited, transcribed, whether or not to be encrypted, and characters. 11. The information processing apparatus according to claim 10, wherein the voice is a voice for specifying one or more of the languages at the time of recognition.
The invention described in claim 13 is characterized in that the voice for specifying the structure of the data of the still image is for specifying one or more of insertion of a document and deletion of a document in the data of the still image. 11. The information processing apparatus according to claim 10, wherein the information processing apparatus is a voice.
The invention described in claim 14 includes: an imaging means for imaging moving images of a plurality of originals; a recording means for recording audio in parallel with the imaging; data of the captured moving images; a management means that acquires the audio data, associates the video data and the audio data, and manages them on the same time axis; and detects a predetermined audio from the audio data. a detection means, a generation means for generating still image data of the plurality of originals from the moving image data, and an image capturing unit of the generated still image data at a timing when the predetermined sound is uttered; The present invention is an information processing system characterized by comprising a process execution control unit that performs control to execute a predetermined process on data of a still image of a target document.
The invention described in claim 15 allows a computer to manage moving image data of a plurality of imaged manuscripts and audio data recorded in parallel with the imaging in the same time axis. a function to detect a predetermined sound from the audio data; and a function to detect a predetermined sound from the data of the plurality of originals generated from the video data. This is a program for realizing a function of performing control to perform predetermined processing on data of a still image of a document to be imaged at a specified timing.

請求項１の本発明によれば、複数枚の原稿の動画像のデータから生成される複数の静止画像のデータのうち、特定のデータに対する予め定められた処理を行う場合におけるユーザの作業負担を従来よりも軽減させることを可能にする情報処理装置を提供できる。
請求項２の本発明によれば、複数枚の原稿を連続的に撮像した１の動画像のデータから生成される複数の静止画像のデータのうち、特定のデータに対する予め定められた処理を行う場合におけるユーザの作業負担を従来よりも軽減させることができる。
請求項３の本発明によれば、ユーザの２種類の動作により複数枚の原稿を連続的に撮像した１の動画像のデータから生成される複数の静止画像のデータのうち、特定のデータに対する予め定められた処理を行う場合におけるユーザの作業負担を従来よりも軽減させることができる。
請求項４の本発明によれば、ユーザの複数枚の原稿を撮像する動作と、複数枚の原稿の各々を被写体とするための動作とにより撮像した１の動画像のデータから生成される複数の静止画像のデータのうち、特定のデータに対する予め定められた処理を行う場合におけるユーザの作業負担を従来よりも軽減させることができる。
請求項５の本発明によれば、ユーザの複数枚の原稿を撮像する動作と、複数枚の原稿を１枚ずつめくる動作とにより撮像した１の動画像のデータから生成される複数の静止画像のデータのうち、特定のデータに対する予め定められた処理を行う場合におけるユーザの作業負担を従来よりも軽減させることができる。
請求項６の本発明によれば、複数枚の原稿の動画像のデータから生成される複数の静止画像のデータのうち特定のデータに対し、予め定められた音声に対応する処理を行う場合におけるユーザの作業負担を従来よりも軽減させることができる。
請求項７の本発明によれば、複数枚の原稿の動画像のデータから生成される複数の静止画像のデータのうち特定のデータに対し、予め定められた音声に対応する処理を行う場合におけるユーザの作業負担を従来よりも軽減させることができる。
請求項８の本発明によれば、処理の対象となる特定の原稿の静止画像のデータに対する処理の指定を、予め定められた音声であることを示す音声と、予め定められた処理の対象となる原稿を指定するための音声と、予め定められた処理の内容を示す音声と、録音された音声を削除するための音声とのうち少なくとも１の音声で行うことが可能となる。
請求項９の本発明によれば、処理の対象となる特定の原稿が見開きの複数枚の原稿に含まれる場合であっても、その原稿に対する処理の指定を音声で行うことが可能となる。
請求項１０の本発明によれば、処理の対象となる特定の原稿の静止画像のデータの出力形式、属性、および構成の各々の指定を音声で行うことが可能となる。
請求項１１の本発明によれば、処理の対象となる特定の原稿の静止画像のデータの出力形式としてのファイルの形式、色、および向きの各々の指定を音声で行うことが可能となる。
請求項１２の本発明によれば、処理の対象となる特定の原稿の静止画像のデータの属性としての印刷の可否、編集の可否、転記の可否、暗号化の有無、および文字認識時の言語の各々の指定を音声で行うことが可能となる。
請求項１３の本発明によれば、処理の対象となる特定の原稿の静止画像のデータの構成としての原稿の挿入および原稿の削除の各々の指定を音声で行うことが可能となる。
請求項１４の本発明によれば、複数枚の原稿の動画像のデータから生成される複数の静止画像のデータのうち、特定のデータに対する予め定められた処理を行う場合におけるユーザの作業負担を従来よりも軽減させることを可能にする情報処理システムを提供できる。
請求項１５の本発明によれば、複数枚の原稿の動画像のデータから生成される複数の静止画像のデータのうち、特定のデータに対する予め定められた処理を行う場合におけるユーザの作業負担を従来よりも軽減させることを可能にするプログラムを提供できる。 According to the first aspect of the invention, it is possible to reduce the work burden on the user when performing predetermined processing on specific data among a plurality of still image data generated from moving image data of a plurality of documents. It is possible to provide an information processing device that can reduce the amount of damage compared to conventional methods.
According to the second aspect of the invention, predetermined processing is performed on specific data among a plurality of still image data generated from one moving image data obtained by sequentially capturing a plurality of originals. In this case, the user's workload can be reduced compared to the conventional method.
According to the third aspect of the present invention, among the plurality of still image data generated from the data of one moving image obtained by sequentially capturing a plurality of documents by two types of user's actions, the The user's workload when performing predetermined processing can be reduced compared to the conventional method.
According to the present invention as set forth in claim 4, a plurality of moving images generated from data of one moving image captured by a user's action of capturing a plurality of originals and an action of using each of the plurality of originals as a subject. The user's workload when performing predetermined processing on specific data among still image data can be reduced compared to the conventional method.
According to the present invention of claim 5, a plurality of still images are generated from data of one moving image captured by a user's action of capturing a plurality of originals and the action of turning over the plurality of originals one by one. The user's workload when performing predetermined processing on specific data among the data can be reduced compared to the conventional method.
According to the present invention as set forth in claim 6, when processing corresponding to predetermined audio is performed on specific data among a plurality of still image data generated from moving image data of a plurality of originals, The user's workload can be reduced compared to the conventional method.
According to the seventh aspect of the present invention, when processing corresponding to a predetermined sound is performed on specific data among a plurality of still image data generated from moving image data of a plurality of documents, The user's workload can be reduced compared to the conventional method.
According to the present invention of claim 8, the designation of processing for still image data of a specific document to be processed is specified by a sound indicating that the data is a predetermined sound and a predetermined processing target. This can be done using at least one of the following voices: a voice for specifying the document to be processed, a voice for indicating the content of predetermined processing, and a voice for deleting the recorded voice.
According to the ninth aspect of the present invention, even if a specific document to be processed is included in a plurality of double-page spread documents, it is possible to specify processing for that document by voice.
According to the tenth aspect of the present invention, each of the output format, attributes, and configuration of still image data of a specific document to be processed can be designated by voice.
According to the eleventh aspect of the present invention, it is possible to specify each of the file format, color, and orientation as the output format of still image data of a specific document to be processed by voice.
According to the present invention as set forth in claim 12, the attributes of data of a still image of a specific original to be processed include whether or not it can be printed, whether it can be edited, whether it can be transcribed, whether or not it is encrypted, and the language at the time of character recognition. It becomes possible to make each specification by voice.
According to the thirteenth aspect of the present invention, it is possible to specify by voice the insertion of a document and the deletion of a document as a data configuration of a still image of a specific document to be processed.
According to the fourteenth aspect of the present invention, it is possible to reduce the work burden on the user when performing predetermined processing on specific data among a plurality of still image data generated from moving image data of a plurality of documents. It is possible to provide an information processing system that makes it possible to reduce costs more than before.
According to the fifteenth aspect of the present invention, it is possible to reduce the work burden on the user when performing predetermined processing on specific data among a plurality of still image data generated from moving image data of a plurality of documents. We can provide a program that makes it possible to reduce costs more than before.

本実施の形態が適用される情報処理システムの全体構成の一例を示す図である。1 is a diagram illustrating an example of the overall configuration of an information processing system to which this embodiment is applied. 本実施の形態が適用される情報処理装置としての管理サーバのハードウェア構成を示す図である。1 is a diagram showing a hardware configuration of a management server as an information processing device to which this embodiment is applied. 本実施の形態が適用される情報処理装置としてのユーザ端末のハードウェア構成を示す図である。1 is a diagram showing a hardware configuration of a user terminal as an information processing device to which this embodiment is applied. 管理サーバの制御部の機能構成を示す図である。FIG. 3 is a diagram showing a functional configuration of a control unit of a management server. ユーザ端末の制御部の機能構成を示す図である。FIG. 3 is a diagram showing a functional configuration of a control unit of a user terminal. ユーザ端末の処理の流れを示すフローチャートである。3 is a flowchart showing the flow of processing of a user terminal. 管理サーバの処理の流れを示すフローチャートである。It is a flowchart which shows the flow of processing of a management server. ユーザによるユーザ端末の操作の具体例を示す図である。FIG. 3 is a diagram illustrating a specific example of operation of a user terminal by a user. 図８の動画音声データから生成される静止画像のデータの具体例を示す図である。9 is a diagram showing a specific example of still image data generated from the video audio data of FIG. 8. FIG. 動画音声データのうち動画像のデータから原稿を切り出して静止画像のデータを生成する処理の具体例を示す図である。FIG. 6 is a diagram illustrating a specific example of processing for generating still image data by cutting out a document from moving image data of moving image audio data. 見開きの２ページの原稿を分割して、２つの静止画像のデータを生成する処理の具体例を示す図である。FIG. 6 is a diagram illustrating a specific example of processing for dividing a two-page spread document to generate two still image data. 予め定められた処理の内容を示す音声コマンドの具体例を示す図である。FIG. 6 is a diagram showing a specific example of a voice command indicating the content of predetermined processing.

以下、添付図面を参照して、本発明の実施の形態について詳細に説明する。
（情報処理システムの構成）
図１は、本実施の形態が適用される情報処理システム１の全体構成の一例を示す図である。
情報処理システム１は、管理サーバ１０と、ユーザ端末３０とがネットワーク９０を介して接続されることにより構成されている。ネットワーク９０は、例えば、ＬＡＮ（Local Area Network）、インターネット等である。 Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings.
(Configuration of information processing system)
FIG. 1 is a diagram showing an example of the overall configuration of an information processing system 1 to which this embodiment is applied.
The information processing system 1 is configured by a management server 10 and a user terminal 30 connected via a network 90. The network 90 is, for example, a LAN (Local Area Network), the Internet, or the like.

管理サーバ１０は、情報処理システム１を管理するサーバとしての情報処理装置である。例えば、管理サーバ１０は、複数枚の原稿が１枚ずつめくられていく様子の一部始終を撮像した動画像のデータと、その撮像と同時進行で録音されたユーザの音声のデータとを対応付けて同一の時間軸で管理する。そして、管理サーバ１０は、録音された音声のデータから予め定められた音声を検出し、動画像のデータから生成した複数枚の原稿の静止画像のデータのうち、検出した音声がユーザから発されたタイミングで撮像対象とされた原稿の静止画像のデータに対して予め定められた処理を実行する。 The management server 10 is an information processing device that serves as a server that manages the information processing system 1. For example, the management server 10 may correspond to data of a moving image captured from beginning to end of a plurality of manuscripts being turned over one by one, and data of a user's voice recorded at the same time as the imaging. and manage them on the same time axis. Then, the management server 10 detects a predetermined voice from the recorded voice data, and detects the detected voice from the still image data of the plurality of manuscripts generated from the moving image data. A predetermined process is executed on data of a still image of a document to be imaged at a certain timing.

ここで、「原稿」とは、表面または表面および裏面の両方に文字や図形が形成された用紙のことをいう。「複数枚の原稿」とは、複数枚の原稿が束ねられている状態をいう。なお、ここでいう「束ねられている」とは、例えば、複数枚の原稿が製本された状態であってもよいし、一部がホチキス等で綴じられた状態であってもよい。また、ホチキス等で綴じられることなく個々の原稿が独立しているが１つの書類として揃えられている状態であってもよい。 Here, the term "manuscript" refers to a sheet of paper on which characters or figures are formed on the front or both the front and back sides. "Multiple originals" refers to a state in which multiple originals are bundled. Note that "bundled" here may mean, for example, a state in which a plurality of manuscripts are bound into a book, or a state in which some of the manuscripts are bound with staples or the like. Alternatively, individual manuscripts may be independent without being bound with staples or the like, but they may be arranged as one document.

また、「予め定められた音声」とは、後述する記憶部１３（図２参照）のデータベースにおいて、予め定められた処理を示す情報と対応付けられて記憶されている音声のことをいう。「予め定められた音声」としては、例えば、予め定められた音声であることを示す音声、予め定められた処理の対象となる原稿を指定するための音声、その処理の内容を示す音声、録音された音声を削除するための音声等が挙げられる。以下、「予め定められた音声」のことを「音声コマンド」と呼ぶ。なお、音声コマンドの具体例、および管理サーバ１０による上述の処理の詳細については後述する。 Furthermore, "predetermined voice" refers to voice that is stored in a database of the storage unit 13 (see FIG. 2), which will be described later, in association with information indicating a predetermined process. "Predetermined audio" includes, for example, a voice indicating that it is a predetermined voice, a voice for specifying a document to be subjected to predetermined processing, a voice indicating the content of the processing, and a recording. Examples include audio for deleting audio that has been recorded. Hereinafter, the "predetermined voice" will be referred to as a "voice command." Note that specific examples of voice commands and details of the above-mentioned processing by the management server 10 will be described later.

ユーザ端末３０は、ユーザが操作するスマートフォン、タブレット端末等の情報処理装置である。例えば、ユーザ端末３０は、ユーザの撮像操作に基づき複数枚の原稿を連続的に撮像することで１つの動画像のデータを生成する。撮像操作を行うユーザは、片手でユーザ端末３０を持ち、複数枚の原稿をユーザ端末３０の表示部３６（図３参照）に表示させながら動画像を撮像する。そして、もう一方の手で複数枚の原稿を上から順番に１枚ずつめくる動作を行うことですべての原稿の表面および裏面の動画像の撮像を行う。例えば、ユーザは、複数枚の原稿を上から順番に１枚ずつめくる動作を右手で行いながら、左手に持ったユーザ端末３０で原稿を被写体とする動画像の撮像を行う。 The user terminal 30 is an information processing device such as a smartphone or a tablet terminal operated by a user. For example, the user terminal 30 generates one moving image data by continuously capturing images of a plurality of original documents based on the user's imaging operation. A user who performs an imaging operation holds the user terminal 30 with one hand and captures a moving image while displaying a plurality of documents on the display unit 36 of the user terminal 30 (see FIG. 3). Then, by sequentially turning over the plurality of documents one by one from the top with the other hand, moving images of the front and back sides of all the documents are captured. For example, the user uses the user terminal 30 held in his left hand to capture a moving image of the document as a subject while sequentially turning over a plurality of documents one by one from the top with his right hand.

ここで、撮像操作を行うユーザは、予め定められた処理の対象となる原稿が撮像の途中で表示部３６に表示されると、その原稿が撮像対象となっている時間帯のいずれかのタイミングで音声コマンドを発する。ユーザ端末３０は、ユーザから発せられた音声コマンドを録音し、その音声のデータと、撮像した複数枚の原稿の動画像のデータとを同一の時間軸で対応付けて記憶する。ユーザ端末３０は、対応付けて記憶した動画像のデータと音声のデータとの組み合わせのデータを管理サーバ１０に向けて送信する。なお、ユーザ端末３０によるこれらの処理の詳細については後述する。 Here, when a document to be subjected to predetermined processing is displayed on the display unit 36 during image pickup, the user who performs the image capture operation can perform the image capture operation at any time during the time period during which the document is to be imaged. Issue voice commands. The user terminal 30 records voice commands issued by the user, and stores the voice data and the captured moving image data of a plurality of original documents in association with each other on the same time axis. The user terminal 30 transmits data of a combination of moving image data and audio data stored in association to the management server 10 . Note that details of these processes by the user terminal 30 will be described later.

なお、上述の情報処理システム１を構成する管理サーバ１０およびユーザ端末３０の各々の機能は一例であり、情報処理システム１全体として上述の処理を実現させる機能を備えていればよい。このため、上述の処理を実現させる機能のうち、一部または全部を情報処理システム１内で分担してもよいし協働してもよい。すなわち、管理サーバ１０の機能の一部または全部をユーザ端末３０の機能としてもよいし、ユーザ端末３０の機能の一部または全部を管理サーバ１０の機能としてもよい。さらに、情報処理システム１を構成する管理サーバ１０およびユーザ端末３０の各々の機能の一部または全部を、図示せぬ他のサーバや撮像装置等に移譲してもよい。これにより、情報処理システム１全体としての処理が促進され、また、処理を補完し合うことも可能となる。 Note that the functions of each of the management server 10 and user terminal 30 that constitute the above-mentioned information processing system 1 are merely examples, and the information processing system 1 as a whole may have a function that realizes the above-mentioned processing. For this reason, some or all of the functions for realizing the above-described processing may be shared within the information processing system 1 or may be performed in collaboration. That is, some or all of the functions of the management server 10 may be provided as functions of the user terminal 30, or some or all of the functions of the user terminal 30 may be provided as functions of the management server 10. Further, a part or all of the functions of the management server 10 and the user terminal 30 that constitute the information processing system 1 may be transferred to another server, an imaging device, or the like (not shown). This facilitates the processing of the information processing system 1 as a whole, and also makes it possible to complement each other in processing.

（管理サーバのハードウェア構成）
図２は、本実施の形態が適用される情報処理装置としての管理サーバ１０のハードウェア構成を示す図である。
管理サーバ１０は、制御部１１と、メモリ１２と、記憶部１３と、通信部１４と、操作部１５と、表示部１６とを有している。これらの各部は、データバス、アドレスバス、ＰＣＩ（Peripheral Component Interconnect）バス等で接続されている。 (Hardware configuration of management server)
FIG. 2 is a diagram showing the hardware configuration of the management server 10 as an information processing device to which this embodiment is applied.
The management server 10 includes a control section 11 , a memory 12 , a storage section 13 , a communication section 14 , an operation section 15 , and a display section 16 . These units are connected via a data bus, an address bus, a PCI (Peripheral Component Interconnect) bus, and the like.

制御部１１は、ＯＳ（基本ソフトウェア）やアプリケーションソフトウェア（応用ソフトウェア）等の各種ソフトウェアの実行を通じて管理サーバ１０の機能の制御を行うプロセッサである。制御部１１は、例えばＣＰＵ（Central Processing Unit）で構成される。メモリ１２は、各種ソフトウェアやその実行に用いるデータ等を記憶する記憶領域であり、演算に際して作業エリアとして用いられる。メモリ１２は、例えばＲＡＭ（Random Access Memory）等で構成される。 The control unit 11 is a processor that controls the functions of the management server 10 through execution of various software such as an OS (basic software) and application software. The control unit 11 is composed of, for example, a CPU (Central Processing Unit). The memory 12 is a storage area that stores various software and data used for its execution, and is used as a work area during calculations. The memory 12 is composed of, for example, a RAM (Random Access Memory).

記憶部１３は、各種ソフトウェアに対する入力データや各種ソフトウェアからの出力データ等を記憶する記憶領域である。記憶部１３は、例えばプログラムや各種設定データなどの記憶に用いられるＨＤＤ（Hard Disk Drive）やＳＳＤ（Solid State Drive）、半導体メモリ等で構成される。記憶部１３には、各種情報を記憶するデータベースとして、例えば、ユーザ端末３０から送信されてきた、動画像のデータと音声のデータとを組み合わせたデータが記憶された動画音声ＤＢ８０１と、動画像のデータから生成された複数枚の原稿の各々の静止画像のデータが記憶された静止画像ＤＢ８０２と、音声コマンドと予め定められた処理を示す情報とが対応付けられて記憶された音声コマンドＤＢ８０３等が格納されている。 The storage unit 13 is a storage area that stores input data for various software, output data from various software, and the like. The storage unit 13 includes, for example, an HDD (Hard Disk Drive), an SSD (Solid State Drive), a semiconductor memory, etc. used for storing programs, various setting data, and the like. The storage unit 13 includes, as a database for storing various information, a video and audio DB 801 that stores data that is a combination of video data and audio data sent from the user terminal 30, and A still image DB 802 in which still image data of each of a plurality of manuscripts generated from the data is stored, a voice command DB 803 in which voice commands and information indicating predetermined processing are stored in association with each other, etc. Stored.

通信部１４は、ネットワーク９０を介してユーザ端末３０および外部との間でデータの送受信を行う。操作部１５は、例えばキーボード、マウス、機械式のボタン、スイッチで構成され、入力操作を受け付ける。操作部１５には、表示部１６と一体的にタッチパネルを構成するタッチセンサも含まれる。表示部１６は、例えば情報の表示に用いられる液晶ディスプレイや有機ＥＬ（＝Electro Luminescence）ディスプレイで構成され、画像やテキストのデータなどを表示する。 The communication unit 14 transmits and receives data between the user terminal 30 and the outside via the network 90. The operation unit 15 includes, for example, a keyboard, a mouse, mechanical buttons, and switches, and accepts input operations. The operation unit 15 also includes a touch sensor that integrally forms a touch panel with the display unit 16. The display unit 16 includes, for example, a liquid crystal display or an organic EL (Electro Luminescence) display used for displaying information, and displays images, text data, and the like.

（ユーザ端末のハードウェア構成）
図３は、本実施の形態が適用される情報処理装置としてのユーザ端末３０のハードウェア構成を示す図である。
ユーザ端末３０は、図２の管理サーバ１０の制御部１１、メモリ１２、記憶部１３、通信部１４、操作部１５、および表示部１６の各々に対応する、制御部３１、メモリ３２、記憶部３３、通信部３４、操作部３５、および表示部３６の各々を有しており、これらの構成に加えて、撮像部３７と、録音部３８とを有している。これらの各部は、データバス、アドレスバス、ＰＣＩバス等で接続されている。 (Hardware configuration of user terminal)
FIG. 3 is a diagram showing the hardware configuration of the user terminal 30 as an information processing device to which this embodiment is applied.
The user terminal 30 includes a control unit 31, a memory 32, and a storage unit that correspond to the control unit 11, memory 12, storage unit 13, communication unit 14, operation unit 15, and display unit 16 of the management server 10 in FIG. 33, a communication section 34, an operation section 35, and a display section 36. In addition to these components, it also includes an imaging section 37 and a recording section 38. These units are connected via a data bus, address bus, PCI bus, etc.

撮像部３７は、カメラ等で構成され、カメラのファインダとしても機能する表示部３６に表示された被写体としての複数の原稿を撮像して、動画像のデータとして取得する。録音部３８は、ユーザから発せられた音声を録音して、音声のデータとして取得する。 The image capturing section 37 is configured with a camera or the like, and captures images of a plurality of documents as objects displayed on the display section 36, which also functions as a viewfinder of the camera, and obtains them as moving image data. The recording unit 38 records the voice uttered by the user and obtains it as voice data.

（管理サーバの制御部の機能構成）
図４は、管理サーバ１０の制御部１１の機能構成を示す図である。
管理サーバ１０の制御部１１では、情報取得部１０１と、動画音声管理部１０２と、音声検出部１０３と、画像生成部１０４と、コマンド管理部１０５と、処理制御部１０６と、送信制御部１０７とが機能する。 (Functional configuration of control unit of management server)
FIG. 4 is a diagram showing the functional configuration of the control unit 11 of the management server 10.
The control unit 11 of the management server 10 includes an information acquisition unit 101, a video and audio management unit 102, an audio detection unit 103, an image generation unit 104, a command management unit 105, a processing control unit 106, and a transmission control unit 107. and works.

情報取得部１０１は、通信部１４（図２参照）を介して情報を取得する。例えば、情報取得部１０１は、ユーザ端末３０から送信されてきた、動画像のデータと音声のデータとの組み合わせのデータを取得する。具体的には、情報取得部１０１は、複数枚の原稿が１枚ずつめくられていく様子の一部始終を撮像した動画像のデータと、撮像の途中でユーザから発せられた音声のデータとの組み合わせのデータを取得する。以下、動画像のデータと音声のデータとの組み合わせのデータのことを「動画音声データ」と呼ぶ。 The information acquisition unit 101 acquires information via the communication unit 14 (see FIG. 2). For example, the information acquisition unit 101 acquires data that is a combination of moving image data and audio data transmitted from the user terminal 30. Specifically, the information acquisition unit 101 acquires data of a moving image captured from beginning to end as a plurality of manuscripts are turned over one by one, and data of a voice uttered by the user during the imaging. Get the data of the combination. Hereinafter, data that is a combination of video data and audio data will be referred to as "video audio data."

動画音声管理部１０２は、管理手段として、情報取得部１０１により取得された動画音声データを、記憶部１３の動画音声ＤＢ８０１（図２参照）に記憶されて管理する。動画音声データには、複数枚の原稿の各々の外観と、複数枚の原稿の各々に形成された画像とが含まれる。ここで、「形成された画像」とは、原稿となる用紙の印刷面に「印刷された文字や図形等の画像」のことをいい、「印刷面」は表面のみの場合と、両面（表面および裏面）の場合とがある。 The video and audio management unit 102, as a management unit, stores and manages the video and audio data acquired by the information acquisition unit 101 in the video and audio DB 801 (see FIG. 2) of the storage unit 13. The video and audio data includes the appearance of each of the plurality of originals and images formed on each of the plurality of originals. Here, the "formed image" refers to the "image of characters, figures, etc. printed on the printed side of the paper that serves as the manuscript," and the "printed surface" refers to cases where only the front side is printed, and both sides (the front side). and reverse side).

また、上述のように、動画音声管理部１０２は、動画像のデータと音声のデータとが同一の時間軸で対応付けられた動画音声データを管理している。例えば、ユーザが、撮像および録音を開始して１０秒が経過したタイミングで音声コマンドを発せられた場合には、ユーザから発せられた音声コマンドを示す情報と、そのタイミングで撮像対象とされたｎ枚目（ｎは１以上の整数値）の原稿を示す情報とが対応付けられている。なお、動画音声データの具体例については、図９を参照して後述する。 Furthermore, as described above, the video and audio management unit 102 manages video and audio data in which video data and audio data are associated on the same time axis. For example, if the user issues a voice command 10 seconds after starting imaging and recording, information indicating the voice command issued by the user and the n Information indicating the document sheet (n is an integer value of 1 or more) is associated with the original document. Note that a specific example of video audio data will be described later with reference to FIG. 9.

音声検出部１０３は、検出手段として、動画音声管理部１０２によりデータベースに記憶されて管理されている動画音声データのうち、音声のデータを解析して音声コマンドを検出する。音声コマンドを検出する手法は特に限定されず、従来の手法を用いることができる。例えば、音声のデータに含まれる音の強弱、周波数、間隔等の特徴量を抽出し、予めデータベース（例えば、記憶部１３の音声コマンドＤＢ８０３）に記憶されている音素や単語のモデルとの整合率を計算して単語として認識する等の技術が用いられる。 The audio detection unit 103 serves as a detection means and analyzes audio data from among the video and audio data stored and managed in the database by the video and audio management unit 102 to detect audio commands. The method for detecting voice commands is not particularly limited, and conventional methods can be used. For example, feature quantities such as the strength, frequency, and interval of sounds included in the voice data are extracted, and the consistency rate with the phoneme and word models stored in advance in the database (for example, the voice command DB 803 of the storage unit 13) is extracted. Techniques such as calculating and recognizing words as words are used.

音声検出部１０３により検出される音声コマンドの種類としては、例えば、発せられた内容が音声コマンドであることを示す音声コマンド、予め定められた処理の対象となる原稿を指定するための音声コマンド、予め定められた処理の内容を示す音声コマンド、録音された音声コマンドを削除するための音声コマンド等が挙げられる。 The types of voice commands detected by the voice detection unit 103 include, for example, a voice command indicating that the uttered content is a voice command, a voice command for specifying a document to be subjected to predetermined processing, Examples include a voice command indicating the content of a predetermined process, a voice command for deleting a recorded voice command, and the like.

上記の音声コマンドの種類のうち、予め定められた処理の対象となる原稿を指定するための音声コマンド、および録音された音声コマンドを削除するための音声コマンドには、例えば、見開きの複数枚の原稿のうち対象となる１の原稿を指定するための音声コマンドが含まれる。また、予め定められた処理の内容を示す音声コマンドには、例えば、予め定められた処理の対象となる静止画像のデータの出力形式、属性、および構成の各々を指定するための音声コマンドが含まれる。なお、予め定められた処理の内容を示す音声コマンドの具体例については、図１２を参照して後述する。 Among the types of voice commands listed above, voice commands for specifying a document to be processed in advance and voice commands for deleting recorded voice commands include, for example, multiple page spreads of A voice command for specifying one target document among the documents is included. Furthermore, the voice commands indicating the contents of predetermined processing include, for example, voice commands for specifying the output format, attributes, and configuration of still image data to be subjected to predetermined processing. It will be done. Note that a specific example of a voice command indicating the content of a predetermined process will be described later with reference to FIG. 12.

画像生成部１０４は、生成手段として、動画音声管理部１０２によりデータベースに記憶されて管理されている動画音声データのうち動画像のデータから、複数枚の原稿の各々の静止画像のデータを生成する。画像生成部１０４により生成された静止画像のデータは、記憶部１３の静止画像ＤＢ８０２（図２参照）に記憶されて管理される。 The image generation unit 104, as a generation unit, generates still image data for each of a plurality of manuscripts from the video data of the video and audio data stored in the database and managed by the video and audio management unit 102. . The still image data generated by the image generation unit 104 is stored and managed in the still image DB 802 (see FIG. 2) of the storage unit 13.

コマンド管理部１０５は、音声コマンドと予め定められた処理を示す情報とを対応付けて、音声コマンドＤＢ８０３に記憶させて管理する。なお、音声コマンドＤＢ８０３に記憶されている情報の具体例については、図１２を参照して後述する。 The command management unit 105 associates voice commands with information indicating predetermined processing, stores them in the voice command DB 803, and manages them. Note that a specific example of the information stored in the voice command DB 803 will be described later with reference to FIG. 12.

処理制御部１０６は、処理実行制御手段として、画像生成部１０４により生成される複数枚の原稿の各々の静止画像のデータのうち、音声コマンドが発せられたタイミングで撮像対象とされた原稿の静止画像のデータに対し、予め定められた処理を実行する制御を行う。 The processing control unit 106, as a processing execution control means, generates a still image of the original to be imaged at the timing when the voice command is issued, from among still image data of each of the plurality of originals generated by the image generation unit 104. Control is performed to perform predetermined processing on image data.

送信制御部１０７は、通信部１４（図２参照）を介して各種情報をユーザ端末３０または外部に向けて送信する制御を行う。例えば、送信制御部１０７は、画像生成部１０４により生成された、複数枚の原稿の各々の静止画像のデータをユーザ端末３０に向けて送信する制御を行う。 The transmission control unit 107 controls the transmission of various information to the user terminal 30 or the outside via the communication unit 14 (see FIG. 2). For example, the transmission control unit 107 performs control to transmit data of still images of each of the plurality of manuscripts generated by the image generation unit 104 to the user terminal 30.

（ユーザ端末の制御部の機能構成）
図５は、ユーザ端末３０の制御部３１の機能構成を示す図である。
ユーザ端末３０の制御部３１では、表示制御部３０１と、撮像制御部３０２と、録音制御部３０３と、送信制御部３０４と、情報取得部３０５とが機能する。 (Functional configuration of control unit of user terminal)
FIG. 5 is a diagram showing the functional configuration of the control unit 31 of the user terminal 30.
In the control unit 31 of the user terminal 30, a display control unit 301, an imaging control unit 302, a recording control unit 303, a transmission control unit 304, and an information acquisition unit 305 function.

表示制御部３０１は、各種情報を表示部３６（図３参照）に表示させる制御を行う。例えば、表示制御部３０１は、撮像対象の複数枚の原稿を表示部３６に表示させる制御を行う。また、表示制御部３０１は、後述する情報取得部３０５により取得された静止画像のデータを表示部３６に表示させる制御を行う。 The display control unit 301 performs control to display various information on the display unit 36 (see FIG. 3). For example, the display control unit 301 controls the display unit 36 to display a plurality of documents to be imaged. Furthermore, the display control unit 301 controls the display unit 36 to display still image data acquired by an information acquisition unit 305, which will be described later.

撮像制御部３０２は、撮像手段として、複数枚の原稿を被写体とする動画像を撮像部３７（図３参照）に撮像させる制御を行う。具体的には、撮像制御部３０２は、複数枚の原稿が１枚ずつめくられていく様子の一部始終を撮像部３７に連続的に撮像させる制御を行う。録音制御部３０３は、録音手段として、撮像部３７による撮像と並行してユーザから発せられる音声を録音部３８に録音させる制御を行う。 The imaging control unit 302 functions as an imaging unit and controls the imaging unit 37 (see FIG. 3) to take a moving image of a plurality of originals as subjects. Specifically, the imaging control unit 302 controls the imaging unit 37 to continuously capture images of the entire process of turning over a plurality of originals one by one. The recording control unit 303 acts as a recording unit and controls the recording unit 38 to record the voice emitted by the user in parallel with the imaging by the imaging unit 37.

送信制御部３０４は、通信部３４（図３参照）を介して各種情報を管理サーバ１０または外部に向けて送信する制御を行う。例えば、送信制御部３０４は、撮像部３７による撮像の結果生成された動画像のデータと、録音部３８による録音の結果生成された音声のデータとを組み合わせた動画音声データを管理サーバ１０に向けて送信する制御を行う。 The transmission control unit 304 controls the transmission of various information to the management server 10 or the outside via the communication unit 34 (see FIG. 3). For example, the transmission control unit 304 directs video audio data, which is a combination of video data generated as a result of imaging by the imaging unit 37 and audio data generated as a result of recording by the recording unit 38, to the management server 10. control the transmission.

情報取得部３０５は、通信部３４（図３参照）を介して各種情報を取得する。例えば、情報取得部３０５は、管理サーバ１０から送信されてきた静止画像のデータを取得する。情報取得部３０５により取得される静止画像のデータには、音声コマンドに応じて処理が実行された静止画像のデータが含まれる。 The information acquisition unit 305 acquires various information via the communication unit 34 (see FIG. 3). For example, the information acquisition unit 305 acquires still image data transmitted from the management server 10. The still image data acquired by the information acquisition unit 305 includes still image data that has been processed in response to a voice command.

（ユーザ端末の処理）
図６は、ユーザ端末３０の処理の流れを示すフローチャートである。
まず、ユーザによる撮像操作に基づいて、撮像対象の複数枚の原稿を表示部３６に表示し（ステップ６０１）、複数枚の原稿が１枚ずつめくられていく様子の撮像と、音声の録音とを開始する（ステップ６０２）。これにより、撮像および録音の途中でユーザから発せられる音声コマンドと、動画像のデータとが対応付けられて記憶される。 (User terminal processing)
FIG. 6 is a flowchart showing the process flow of the user terminal 30.
First, based on the user's imaging operation, multiple originals to be imaged are displayed on the display unit 36 (step 601), and images of the multiple originals being turned over one by one and audio recording are performed. (step 602). As a result, the voice command issued by the user during imaging and recording and the moving image data are stored in association with each other.

ユーザ端末３０は、動画像の撮像と音声の録音とが完了すると（ステップ６０３でＹＥＳ）、管理サーバ１０に向けて動画音声データを送信する（ステップ６０４）。これに対して、動画像の撮像と音声の録音とが完了していない場合には（ステップ６０３でＮＯ）、ステップ６０３を繰り返す。 When the user terminal 30 completes the capturing of the moving image and the recording of the audio (YES in step 603), the user terminal 30 transmits the moving image and audio data to the management server 10 (step 604). On the other hand, if the capturing of the moving image and the recording of the audio are not completed (NO in step 603), step 603 is repeated.

その後、ユーザ端末３０は、管理サーバ１０から静止画像のデータが送信されてくると（ステップ６０５でＹＥＳ）、送信されてきたデータを取得して（ステップ６０６）、表示部３６に表示する（ステップ６０７）。これに対して、静止画像のデータが送信されてきていない場合には（ステップ６０５でＮＯ）、ステップ６０５を繰り返す。ステップ６０６で取得された静止画像のデータには、音声コマンドに応じて処理が実行された静止画像のデータが含まれる。 Thereafter, when the user terminal 30 receives still image data from the management server 10 (YES in step 605), the user terminal 30 acquires the transmitted data (step 606) and displays it on the display unit 36 (step 606). 607). On the other hand, if still image data has not been transmitted (NO in step 605), step 605 is repeated. The still image data acquired in step 606 includes still image data that has been processed in response to the voice command.

（管理サーバの処理）
図７は、管理サーバ１０の処理の流れを示すフローチャートである。
管理サーバ１０は、ユーザ端末３０から動画音声データが送信されてくると（ステップ７０１でＹＥＳ）、送信されてきたデータを取得し（ステップ７０２）、データベースに記憶して管理する（ステップ７０３）。具体的には、管理サーバ１０は、記憶部１３の動画音声ＤＢ８０１（図２参照）に動画音声データを記憶して管理する。これに対して、動画音声データが送信されてきていない場合には（ステップ７０１でＮＯ）、動画音声データが送信されてくるまでステップ７０１の処理を繰り返す。 (Management server processing)
FIG. 7 is a flowchart showing the flow of processing by the management server 10.
When the management server 10 receives video and audio data from the user terminal 30 (YES at step 701), it acquires the transmitted data (step 702), stores it in a database, and manages it (step 703). Specifically, the management server 10 stores and manages video and audio data in the video and audio DB 801 (see FIG. 2) of the storage unit 13. On the other hand, if the video audio data has not been transmitted (NO in step 701), the process of step 701 is repeated until the video audio data is transmitted.

管理サーバ１０は、動画音声データのうち動画像のデータから、複数枚の原稿の各々の静止画像のデータを生成し（ステップ７０４）、生成したデータをデータベースに記憶して管理する（ステップ７０５）。具体的には、管理サーバ１０は、記憶部１３の静止画像ＤＢ８０２（図２参照）に静止画像のデータを記憶して管理される。 The management server 10 generates still image data for each of the plurality of manuscripts from the moving image data of the moving image and audio data (step 704), and stores and manages the generated data in a database (step 705). . Specifically, the management server 10 stores and manages still image data in the still image DB 802 (see FIG. 2) of the storage unit 13.

管理サーバ１０は、データベースに記憶した動画音声データのうち音声のデータを解析し、音声コマンドを検出すると（ステップ７０６でＹＥＳ）、ステップ７０４で生成した複数枚の原稿の各々の静止画像のデータのうち、検出した音声コマンドが発せられたタイミングで撮像対象とされた原稿の静止画像のデータに対し、予め定められた処理を実行する（ステップ７０７）。これに対して、音声コマンドが検出されなかった場合（ステップ７０６でＮＯ）、処理は終了する。 The management server 10 analyzes the audio data of the video audio data stored in the database, and when a voice command is detected (YES in step 706), the management server 10 analyzes the still image data of each of the plurality of manuscripts generated in step 704. Among them, predetermined processing is executed on data of a still image of the document that is to be imaged at the timing when the detected voice command is issued (step 707). On the other hand, if no voice command is detected (NO in step 706), the process ends.

（具体例）
図８は、ユーザによるユーザ端末３０の操作の具体例を示す図である。
ユーザは、ユーザ端末３０を用いて、複数枚の原稿の動画像の撮像と、ユーザが発する音声の録音とを同時進行で行う。ユーザ端末３０により撮像された動画像のデータと録音された音声のデータとの組み合わせは、動画音声データとして管理サーバ１０に向けて送信されて、動画音声ＤＢ８０１に記憶されて管理される。例えば、図８に示すように、複数枚の原稿が書籍であったとする。この場合、ユーザは、ユーザ端末３０を持つ手とは反対の手でページをめくりながら動画像を撮像する。なお、ここでいう「ページ」は、文字や図形等の画像が両面に形成されている原稿の片面に相当する。 (Concrete example)
FIG. 8 is a diagram showing a specific example of the operation of the user terminal 30 by the user.
Using the user terminal 30, the user simultaneously captures moving images of a plurality of original documents and records the voice uttered by the user. A combination of moving image data captured by the user terminal 30 and recorded audio data is transmitted to the management server 10 as moving image audio data, and is stored and managed in the moving image audio DB 801. For example, as shown in FIG. 8, assume that the plurality of manuscripts are books. In this case, the user images a moving image while turning the page with the hand opposite to the hand holding the user terminal 30. Note that the "page" here corresponds to one side of a document on which images such as characters and figures are formed on both sides.

書籍の動画像を撮像しているユーザは、予め定められた処理を実行したい原稿が撮像対象となっている時間帯のいずれかのタイミングで音声コマンドを発する。例えば、見開きの３ページ目（左側のページ）および４ページ目（右側のページ）の各々について予め定められた処理を実行したいと考えたとする。この場合、ユーザは、表紙をめくる前から撮像を開始し、表紙をめくることで撮像対象が見開きの１ページ目および２ページ目になり、さらに１枚めくることで撮像対象が見開きの３ページ目および４ページ目になったタイミングから、さらに１枚めくることで撮像対象が見開きの５ページ目および６ページ目になるまでのいずれかのタイミングで音声コマンドを発する。 A user who is capturing a moving image of a book issues a voice command at any time during a time period when a document for which he or she wants to perform predetermined processing is being imaged. For example, assume that it is desired to execute predetermined processing for each of the third page (left page) and fourth page (right page) of a two-page spread. In this case, the user starts imaging before turning over the cover, and by flipping the cover, the imaging target becomes the first and second pages of the spread, and by turning one more page, the imaging target becomes the third page of the spread. Then, the voice command is issued at any timing from the timing when the fourth page is reached until the imaging target reaches the fifth and sixth pages of the double-page spread by turning one more page.

例えば、ユーザが、見開きの３ページおよび４ページのうち、３ページについては初期設定のＰＤＦ（Portable Document Format）形式ではなくＣＳＶ（Comma-Separated Values）形式の電子ファイルとして出力したいと考え、４ページについては出力を行わないようにしたいと考えたとする。この場合、ユーザは、例えば、「コマンドスタート」、「左ページ」、「ＣＳＶ出力」、「コマンドスタート」、「右ページ」、「出力禁止」といった音声コマンドを連続して発する。このとき、「コマンドスタート」、「左ページ」、および「ＣＳＶ出力」が３ページに対する１セットの音声コマンドとなり、「コマンドスタート」、「右ページ」、および「出力禁止」が４ページに対する１セットの音声コマンドとなる。 For example, a user wants to output 3 out of 3 and 4 pages of a two-page spread as an electronic file in CSV (Comma-Separated Values) format instead of the default PDF (Portable Document Format) format, and the 4th page Suppose you want to not output any of the following. In this case, the user successively issues voice commands such as "command start", "left page", "CSV output", "command start", "right page", and "output prohibition", for example. At this time, "command start", "left page", and "CSV output" are one set of voice commands for three pages, and "command start", "right page", and "output prohibition" are one set for four pages. This is a voice command.

音声コマンドのうち、例えば「コマンドスタート」は、予め定められた音声であることを示す音声、すなわち、今発している音声を含めこれから発する音声は音声コマンドであることを示す音声コマンドである。また、例えば、「左ページ」は、予め定められた処理を実行したい原稿が、見開きの左側のページ（３ページ）であることを示す音声コマンドであり、「右ページ」は、予め定められた処理を実行したい原稿が、見開きの右側のページ（４ページ）であることを示す音声コマンドである。また、例えば、「ＣＳＶ出力」および「印刷禁止」は、いずれも予め定められた処理の内容を示す音声コマンドである。予め定められた処理の内容を示す音声コマンドが発せられると、１つの指示が完了したと認識される。 Among the voice commands, for example, "command start" is a voice command that indicates that the voice is a predetermined voice, that is, the voice that will be uttered from now on, including the voice that is being uttered now, is a voice command. For example, "left page" is a voice command indicating that the document on which you want to perform a predetermined process is the left page (page 3) of a two-page spread, and "right page" is a voice command that indicates that the document on which you want to perform a predetermined process is the left page (page 3) of a two-page spread. This is a voice command indicating that the document to be processed is the right page (page 4) of a double-page spread. Further, for example, "CSV output" and "print prohibition" are both voice commands indicating the content of predetermined processing. When a voice command indicating the content of a predetermined process is issued, it is recognized that one instruction has been completed.

図９は、図８の動画音声データから生成される静止画像のデータの具体例を示す図である。
図９の左図は、図８の具体例で生成された動画音声データを時間軸で表した概念図であり、上段が動画像のデータ、下段が音声のデータ、横軸が時間軸となっている。「ｐａｇｅ１，２」は、見開きの１ページおよび２ページを示し、「ｐａｇｅ３，４」は、見開きの３ページおよび４ページを示している。両者の間の区切りの線は、撮像対象が切り替わったタイミングを示している。 FIG. 9 is a diagram showing a specific example of still image data generated from the video audio data of FIG. 8.
The left diagram in Figure 9 is a conceptual diagram showing the video and audio data generated in the specific example of Figure 8 on a time axis, with the upper row being the video data, the lower row being the audio data, and the horizontal axis being the time axis. ing. "page1, 2" indicates pages 1 and 2 of a double-page spread, and "page3, 4" indicates pages 3 and 4 of a double-page spread. A dividing line between the two indicates the timing at which the imaging target is switched.

上述のとおりユーザから音声コマンドが発せられた対象は、見開きの３ページおよび４ページの各々である。このため、撮像対象が見開きの３ページおよび４ページであるときの音声のデータには、音声コマンドが含まれているが、撮像対象が見開きの１ページおよび２ページであるときの音声のデータには音声コマンドは含まれていない。 As described above, the targets for which the user issues voice commands are the third and fourth pages of the double-page spread. For this reason, the audio data when the imaging target is the 3rd and 4th page of a double spread includes voice commands, but the audio data when the imaging target is the 1st and 2nd page of the spread includes voice commands. does not include voice commands.

管理サーバ１０は、動画音声データのうち動画像のデータから、ページ単位で画像を切り出して静止画像のデータを生成する。生成した静止画像のデータは、静止画像ＤＢ８０２に記憶されて管理される。静止画像のデータは、すべてのページについて一律に生成されるようにしてもよいし、動画音声データのうち音声のデータを解析して、音声コマンドが録音されているページのみ静止画像のデータが生成されるようにしてもよい。なお、図９の具体例では、音声コマンドが録音されている３ページ目および４ページ目の各々の静止画像のデータ（ＰＤＦ形式）が生成されている（図９の右図参照）。 The management server 10 generates still image data by cutting out images page by page from the moving image data of the moving image and audio data. The generated still image data is stored and managed in the still image DB 802. Still image data may be generated uniformly for all pages, or still image data may be generated only for pages where voice commands are recorded by analyzing audio data from video audio data. It is also possible to do so. In the specific example of FIG. 9, still image data (in PDF format) for each of the third and fourth pages on which voice commands are recorded is generated (see the right diagram in FIG. 9).

管理サーバ１０は、音声コマンドが録音されているページの静止画像について、音声コマンドに応じた処理を実行する。図９の具体例では、３ページ目について、ＣＳＶ形式で出力する旨が音声コマンドによって指示されているので、ＰＤＦ形式で生成された静止画像のデータがＳＣＶ形式に変換されたうえで出力される（図９の右図参照）。また、４ページ目について、出力しない旨が音声コマンドによって指示されているので、ＰＤＦ形式で生成された静止画像のデータは出力されない（図９の右図参照）。 The management server 10 executes processing according to the voice command on the still image of the page on which the voice command is recorded. In the specific example shown in Figure 9, the voice command instructs to output the third page in CSV format, so the still image data generated in PDF format is converted to SCV format and then output. (See the right diagram in Figure 9). Furthermore, since the fourth page is instructed not to be outputted by the voice command, the still image data generated in PDF format is not outputted (see the right diagram of FIG. 9).

図１０は、動画音声データのうち動画像のデータから原稿を切り出して静止画像のデータを生成する処理の具体例を示す図である。
管理サーバ１０は、動画音声データのうち動画像のデータから、予め定められた間隔で静止画像のデータを切り出し、切り出した複数の静止画像のデータから、前後する静止画像のデータを比較することで差分を抽出し、その差分が予め定められた閾値を超えたかどうかに基づいて、ページがめくられたタイミングと、撮像対象のページが変わったタイミングとを特定する。 FIG. 10 is a diagram illustrating a specific example of processing for cutting out a document from moving image data of moving image audio data and generating still image data.
The management server 10 extracts still image data from the moving image data of the moving image audio data at predetermined intervals, and compares the preceding and following still image data from among the plurality of extracted still image data. The difference is extracted, and based on whether the difference exceeds a predetermined threshold, the timing at which the page is turned and the timing at which the page to be imaged changes are determined.

例えば、図１０に示すように、管理サーバ１０は、タイミングｔ１で切り出された静止画像のデータＧ１と、タイミングｔ２で切り出された静止画像のデータＧ２とを比較して差分を抽出する。ここで、静止画像のデータＧ１と静止画像のデータＧ２とでは差分がないので、管理サーバ１０は、タイミングｔ１とタイミングｔ２との間の時間帯で「ページがめくられなかった」と判断する。 For example, as shown in FIG. 10, the management server 10 compares still image data G1 cut out at timing t1 and still image data G2 cut out at timing t2, and extracts a difference. Here, since there is no difference between the still image data G1 and the still image data G2, the management server 10 determines that "the page was not turned" in the time period between timing t1 and timing t2.

また、管理サーバ１０は、タイミングｔ２で切り出された静止画像のデータＧ２と、タイミングｔ３で切り出された静止画像のデータＧ３とを比較して差分を抽出する。ここで、静止画像のデータＧ２は、ページがめくられ始める直前で切り出された静止画像のデータであり、静止画像のデータＧ３は、ページがめくられている途中で切り出された静止画像のデータであるため、両者の差分は大きくなり、予め定められた閾値を超える。この場合、管理サーバ１０は、タイミングｔ２とタイミングｔ３との間の時間帯は「ページがめくられている途中」であると判断する。 Furthermore, the management server 10 compares still image data G2 cut out at timing t2 and still image data G3 cut out at timing t3, and extracts a difference. Here, the still image data G2 is the still image data cut out just before the page starts to be turned, and the still image data G3 is the still image data cut out while the page is being turned. Therefore, the difference between the two becomes large and exceeds a predetermined threshold. In this case, the management server 10 determines that the time period between timing t2 and timing t3 is "while the page is being turned".

また、管理サーバ１０は、タイミングｔ３で切り出された静止画像のデータＧ３と、タイミングｔ４で切り出された静止画像のデータＧ４とを比較して差分を抽出する。ここで、静止画像のデータＧ３は、ページがめくられている途中で切り出された静止画像のデータであり、静止画像のデータＧ４は、ページがめくられた直後に切り出された静止画像のデータであるため、両者の差分は大きくなり、予め定められた閾値を超える。この場合、管理サーバ１０は、タイミングｔ３とタイミングｔ４との間の時間帯のいずれかのタイミングで「ページがめくり終わった」と判断する。 Furthermore, the management server 10 compares still image data G3 cut out at timing t3 and still image data G4 cut out at timing t4, and extracts a difference. Here, the still image data G3 is the still image data cut out while the page is being turned, and the still image data G4 is the still image data cut out immediately after the page is turned. Therefore, the difference between the two becomes large and exceeds a predetermined threshold. In this case, the management server 10 determines that "the page has finished turning" at some point in the time period between timing t3 and timing t4.

管理サーバ１０は、このような判断の結果に基づいて、動画音声データのうち動画像のデータに含まれる撮像対象のページごとに、撮像開始のタイミングと、撮像終了のタイミングとの各々を示すタイムスタンプを記録する。これにより、ページごとの静止画像のデータの生成が可能となる。 Based on the result of such a determination, the management server 10 determines times indicating the timing to start imaging and the timing to end imaging for each page to be imaged included in the video data of the video audio data. Record the stamp. This makes it possible to generate still image data for each page.

図１１は、見開きの２ページの原稿を分割して、２つの静止画像のデータを生成する処理の具体例を示す図である。
動画音声データのうち動画像のデータに含まれる撮像対象の原稿が、見開きの２ページの原稿である場合には、左側のページと右側のページとに分割されて、それぞれが静止画像のデータとしてデータベース（例えば、図２の記憶部１３の静止画像ＤＢ８０２）に記憶される。 FIG. 11 is a diagram illustrating a specific example of processing for dividing a two-page spread document to generate two still image data.
If the document to be imaged, which is included in the video data of the video audio data, is a two-page spread document, it is divided into a left page and a right page, each of which is processed as still image data. It is stored in a database (for example, the still image DB 802 of the storage unit 13 in FIG. 2).

例えば、図１１の上段に示す図のように、「Ａ」の文字が形成されたページ（左側のページ）と、「Ｂ」の文字が形成されたページ（右側のページ）とが見開きの２ページの原稿の動画像のデータとして記憶されている場合には、分割されて「Ａ」の文字が形成されたページの静止画像のデータと、「Ｂ」の文字が形成されたページの静止画像のデータとが生成される。また、例えば、上述の図１０の具体例の場合には、「Ａ」および「Ｂ」の文字が形成された見開きの２ページの原稿と、「Ｃ」および「Ｄ」の文字が形成された見開きの２ページの原稿との各々が分割されて、「Ａ」乃至「Ｄ」の各々の文字が形成された４つの静止画像のデータが生成される。 For example, as shown in the upper part of Figure 11, a page with the letter "A" formed on it (the left page) and a page with the letter "B" formed on it (the right page) are two-page spreads. If it is stored as moving image data of a page original, there will be still image data of the page that has been divided to form the letter “A” and a still image of the page that has the letter “B” formed. data is generated. Further, for example, in the case of the specific example shown in FIG. 10 described above, there is a two-page spread manuscript in which the letters "A" and "B" are formed, and the letters "C" and "D" are formed. Each two-page spread document is divided to generate four still image data in which letters "A" to "D" are formed.

図１２は、予め定められた処理の内容を示す音声コマンドの具体例を示す図である。
図１２に示すように、予め定められた処理の内容を示す音声コマンドには、例えば、予め定められた処理の対象となる静止画像のデータの出力形式を指定するためのもの、属性を指定するためのもの、および出力時の構成を指定するためのもの等が含まれる。このうち、静止画像のデータの出力形式を指定するための音声コマンドには、例えば、静止画像のデータのファイル形式を指定するもの、色を指定するもの、向きを指定するもの等が挙げられる。 FIG. 12 is a diagram showing a specific example of a voice command indicating the content of predetermined processing.
As shown in FIG. 12, voice commands indicating the content of predetermined processing include, for example, those for specifying the output format of still image data to be subjected to predetermined processing, and attributes. This includes items for specifying the configuration at the time of output, and items for specifying the configuration at the time of output. Among these, voice commands for specifying the output format of still image data include, for example, those for specifying the file format of still image data, those for specifying color, and those for specifying orientation.

静止画像のデータのファイル形式を指定するための音声コマンドの具体例としては、例えば、「ＰＤＦ形式で出力」、「ＪＰＥＧ形式で出力」といったものが挙げられる。このうち、「ＰＤＦ形式で出力」は、生成された静止画像のデータがＰＤＦ形式で出力されるようにするための音声コマンドである。また、「ＪＰＥＧ形式で出力」は、生成された静止画像のデータがＪＰＥＧ（Joint Photographic Experts Group）形式で出力されるようにするための音声コマンドである。 Specific examples of voice commands for specifying the file format of still image data include "output in PDF format" and "output in JPEG format". Among these, "output in PDF format" is a voice command for outputting the generated still image data in PDF format. Furthermore, "output in JPEG format" is a voice command for outputting generated still image data in JPEG (Joint Photographic Experts Group) format.

また、静止画像のデータの色を指定するための音声コマンドの具体例としては、例えば、「フルカラー」、「白黒」といったものが挙げられる。このうち、「フルカラー」は、生成された静止画像のデータがフルカラーで出力されるようにするための音声コマンドである。また、「白黒」は、生成された静止画像のデータが白黒で出力されるようにするための音声コマンドである。また、静止画像のデータの向きを指定するための音声コマンドの具体例としては、例えば、「右９０度回転」等が挙げられる。「右９０度回転」は、生成された静止画像のデータが右に９０度回転して出力されるようにするための音声コマンドである。 Furthermore, specific examples of voice commands for specifying the color of still image data include "full color" and "black and white". Among these, "full color" is a voice command for outputting generated still image data in full color. Furthermore, "black and white" is a voice command for outputting the generated still image data in black and white. Further, a specific example of a voice command for specifying the orientation of still image data includes, for example, "rotate 90 degrees to the right". "Rotate 90 degrees to the right" is a voice command for outputting generated still image data rotated 90 degrees to the right.

また、静止画像のデータの属性を指定するための音声コマンドの具体例としては、例えば、「出力禁止」、「編集禁止」、「転記禁止」、「パスワードの設定（あり／なし）」、「ＯＣＲ言語設定」といったものが挙げられる。このうち、「出力禁止」は、生成された静止画像のデータが出力（例えば、印刷、予め定められたファイル形式への変換等）されないようにするための音声コマンドである。また、「編集禁止」は、生成された静止画像のデータが編集（例えば、内容の変更等）されないようにするための音声コマンドである。 Further, specific examples of voice commands for specifying the attributes of still image data include "Prohibit output", "Prohibit editing", "Prohibit transcription", "Set password (with/without)", " For example, "OCR language settings". Among these, "prohibit output" is a voice command for preventing generated still image data from being output (eg, printed, converted to a predetermined file format, etc.). Further, "edit prohibition" is a voice command for preventing the data of the generated still image from being edited (for example, changing the contents, etc.).

また、「転記禁止」は、生成された静止画像のデータの内容の少なくとも一部が複製利用されないようにするための音声コマンドである。また、「パスワードの設定（あり／なし）」は、生成された静止画像のデータにパスワードを設定するかどうかを指定するための音声コマンドである。また、「ＯＣＲ言語設定」は、使用されている言語が他の原稿とは異なる原稿が混在している場合に言語を設定するための音声コマンドである。 Further, "prohibit transcription" is a voice command for preventing at least a part of the content of the data of the generated still image from being copied and used. Further, "Password setting (with/without)" is a voice command for specifying whether or not to set a password for the generated still image data. Further, "OCR language setting" is a voice command for setting the language when there are documents in which the language used is different from other documents.

また、静止画像のデータの出力時の構成を指定するための音声コマンドとしては、例えば、「白紙ページ挿入」、「ページ削除」といったものが挙げられる。このうち、「白紙ページ挿入」は、生成された複数の静止画像のデータに白紙のページを挿入するための音声コマンドである。また、「ページ削除」は、静止画像のデータが生成されないようにするための音声コマンドである。 Furthermore, examples of voice commands for specifying the configuration of still image data when outputting include "insert blank page" and "delete page". Among these, "insert blank page" is a voice command for inserting a blank page into data of a plurality of generated still images. Furthermore, "page deletion" is a voice command for preventing still image data from being generated.

また、図示はしないが、ユーザが発した音声コマンドが間違っていた場合の対処のための音声コマンドとして、自身が録音した音声を削除するための音声コマンドがある。例えば、撮像対象が見開きの３ページ目および４ページ目である時間帯に発せられた音声コマンドが間違っていたとする。この場合、間違った音声コマンドを発したユーザは、撮像対象が見開きの３ページ目および４ページ目である時間帯に、音声コマンドを取り消すための音声コマンドとして予め設定された音声コマンドと、取り消しの対象となるページを指定するための音声コマンドとを順番で発する。 Further, although not shown, there is a voice command for deleting the voice recorded by the user as a voice command for dealing with the case where the voice command issued by the user is incorrect. For example, assume that the voice command issued during a time period when the imaging target is the third and fourth pages of a double-page spread is incorrect. In this case, the user who issued the wrong voice command issued a voice command that had been set in advance as a voice command for canceling the voice command, and a voice command for canceling the voice command when the imaging target was the 3rd and 4th page of the spread. A voice command for specifying the target page is issued in order.

例えば、ユーザは、３ページ目（左側のページ）を対象とする音声コマンドを取り消したい場合には、「コマンドクリア」、「左ページ」といった音声コマンドを発する。また、例えば、４ページ目（右側のページ）を対象とする音声コマンドを取り消したい場合には、「コマンドクリア」、「右ページ」といった音声コマンドを発する。また、例えば、３ページ目および４ページ目の各々を対象とする音声コマンドをすべて取り消したい場合には、「コマンドクリア」、「全ページ」といった音声コマンドを発する。 For example, if the user wants to cancel a voice command for the third page (left page), he issues a voice command such as "command clear" or "left page." For example, if the user wants to cancel a voice command for the fourth page (right page), a voice command such as "command clear" or "right page" is issued. For example, if the user wants to cancel all voice commands for the third and fourth pages, a voice command such as "command clear" or "all pages" is issued.

以上、本実施の形態について説明したが、本発明は上述した本実施の形態に限るものではない。また、本発明による効果も、上述した本実施の形態に記載されたものに限定されない。例えば、図１に示す情報処理システム１の構成、図２に示す管理サーバ１０のハードウェア構成、および図３に示すユーザ端末３０のハードウェア構成は、いずれも本発明の目的を達成するための例示に過ぎず、特に限定されない。 Although this embodiment has been described above, the present invention is not limited to this embodiment described above. Further, the effects of the present invention are not limited to those described in the present embodiment described above. For example, the configuration of the information processing system 1 shown in FIG. 1, the hardware configuration of the management server 10 shown in FIG. 2, and the hardware configuration of the user terminal 30 shown in FIG. This is merely an example and is not particularly limited.

また、図４に示す管理サーバ１０の機能構成、および図５に示すユーザ端末３０の機能構成も例示に過ぎず、特に限定されない。上述した処理を全体として実行できる機能が図１の情報処理システム１に備えられていれば足り、この機能を実現するためにどのような機能構成を用いるかは図４および図５の例に限定されない。 Furthermore, the functional configuration of the management server 10 shown in FIG. 4 and the functional configuration of the user terminal 30 shown in FIG. 5 are merely examples, and are not particularly limited. It is sufficient that the information processing system 1 in FIG. 1 has a function that can execute the above-described process as a whole, and the functional configuration used to realize this function is limited to the examples in FIGS. 4 and 5. Not done.

また、図６および図７の各々に示す、管理サーバ１０およびユーザ端末３０の各々の処理のステップの順序も例示に過ぎず、特に限定されない。図示されたステップの順序に沿って時系列的に行われる処理だけではなく、必ずしも時系列的に処理されなくとも、並列的あるいは個別的に行われてもよい。また、図８乃至図１２に示す具体例も一例に過ぎず、特に限定されない。 Furthermore, the order of steps in the processes of the management server 10 and the user terminal 30 shown in FIGS. 6 and 7 is merely an example, and is not particularly limited. The processing is not limited to being performed chronologically according to the illustrated order of steps, but may not necessarily be performed chronologically, but may be performed in parallel or individually. Moreover, the specific examples shown in FIGS. 8 to 12 are also only examples, and are not particularly limited.

また、上述の実施の形態における音声コマンドは、それ自体が意味のある言葉として人間が聞き分けられる内容であるが、情報処理装置が区別して認識可能な態様であればよい。このため、他の音声に置き換えて設定することもできる。例えば、「コマンドスタート」という音声コマンドは文字数が多いので、単に「コマンド」という言葉や、「スタート」という短縮した言葉を設定してもよい。また、例えば、「出力禁止」、「編集禁止」、「転記禁止」といった音声コマンドを、「禁止Ａ（エー）」、「禁止Ｂ（ビー）」「禁止Ｃ（シー）」といった記号に置き換えて設定してもよい。このように、音声コマンドを短縮した言葉や記号に置き換えて設定できるようにすることで、音声コマンドを発するユーザの負担を軽減化させることができる。 Furthermore, although the voice commands in the above-described embodiments have content that can be recognized by humans as meaningful words in themselves, they may be in any form that can be distinguished and recognized by the information processing device. For this reason, it is also possible to set it by replacing it with another voice. For example, since the voice command "command start" has a large number of characters, the word "command" or the shortened word "start" may be set. Also, for example, voice commands such as "Prohibit output", "Prohibit editing", and "Prohibit transcription" can be replaced with symbols such as "Prohibit A (A)", "Prohibit B (B)", and "Prohibit C (C)". May be set. In this way, by allowing voice commands to be set by replacing them with abbreviated words or symbols, the burden on the user who issues voice commands can be reduced.

１…情報処理システム、１０…管理サーバ、１１…制御部、３０…ユーザ端末、９０…ネットワーク、１０１…情報取得部、１０２…動画音声管理部、１０３…音声検出部、１０４…画像生成部、１０５…コマンド管理部、１０６…処理制御部、１０７…送信制御部、３０１…表示制御部、３０２…撮像制御部、３０３…録音制御部、３０４…送信制御部、３０５…情報取得部 DESCRIPTION OF SYMBOLS 1... Information processing system, 10... Management server, 11... Control unit, 30... User terminal, 90... Network, 101... Information acquisition unit, 102... Video audio management unit, 103... Audio detection unit, 104... Image generation unit, 105... Command management section, 106... Processing control section, 107... Transmission control section, 301... Display control section, 302... Imaging control section, 303... Recording control section, 304... Transmission control section, 305... Information acquisition section

Claims

プロセッサを備え、
前記プロセッサは、
撮像された複数枚の原稿の動画像のデータと、当該撮像と並行して録音された音声のデータとを対応付けて同一の時間軸で管理し、
前記音声のデータから予め定められた音声を検出し、
前記動画像のデータから生成される前記複数枚の原稿の静止画像のデータのうち、前記予め定められた音声が発せられたタイミングで撮像対象とされた原稿の静止画像のデータに対し、予め定められた処理を実行する制御を行うことを特徴とする、
情報処理装置。 Equipped with a processor,
The processor includes:
The video data of multiple imaged manuscripts and the audio data recorded in parallel with the imaging are managed in the same time axis by being associated with each other.
detecting a predetermined voice from the voice data;
Among the still image data of the plurality of manuscripts generated from the moving image data, a predetermined value is selected for the still image data of the manuscript that is to be imaged at the timing when the predetermined sound is emitted. characterized by controlling the execution of the specified processing;
Information processing device.

前記動画像のデータが、前記複数枚の原稿の各々に形成された画像が連続的に撮像された１の動画像のデータであることを特徴とする、
請求項１に記載の情報処理装置。 The data of the moving image is data of one moving image in which images formed on each of the plurality of originals are continuously captured.
The information processing device according to claim 1.

前記１の動画像のデータが、ユーザの第１の動作と第２の動作に基づいて撮像されたものであり、前記音声のデータが、当該ユーザから発せられた音声が録音されたものであることを特徴とする、
請求項２に記載の情報処理装置。 The data of the first moving image is captured based on a first action and a second action of the user, and the audio data is a recorded sound uttered by the user. characterized by
The information processing device according to claim 2.

前記第１の動作が、前記複数枚の原稿を撮像する動作であり、前記第２の動作が、当該複数枚の原稿の各々を被写体とするための動作であることを特徴とする、
請求項３に記載の情報処理装置。 The first operation is an operation for capturing images of the plurality of originals, and the second operation is an operation for taking each of the plurality of originals as a subject.
The information processing device according to claim 3.

前記複数枚の原稿の各々を被写体とするための動作が、前記複数枚の原稿を１枚ずつめくる動作であることを特徴とする、
請求項４に記載の情報処理装置。 The operation for using each of the plurality of originals as a subject is an operation of turning over the plurality of originals one by one.
The information processing device according to claim 4.

前記プロセッサは、前記予め定められた処理を実行する制御として、前記予め定められた音声に対応する処理を実行する制御を行うことを特徴とする、
請求項１に記載の情報処理装置。 The processor is characterized in that the processor performs control to execute a process corresponding to the predetermined audio as the control to execute the predetermined process.
The information processing device according to claim 1.

前記予め定められた音声および前記予め定められた処理の各々が複数存在し、当該予め定められた音声および当該予め定められた処理の各々が、予め定められたデータベースにおいて対応付けられて記憶されていることを特徴とする、
請求項６に記載の情報処理装置。 A plurality of each of the predetermined voices and the predetermined processes exist, and each of the predetermined voices and the predetermined processes are stored in correspondence in a predetermined database. characterized by having
The information processing device according to claim 6.

前記予め定められた音声が、前記予め定められた音声であることを示す音声と、前記処理の対象となる原稿を指定するための音声と、当該処理の内容を示す音声と、録音された音声を削除するための音声とのうち少なくとも１の音声であることを特徴とする、
請求項７に記載の情報処理装置。 A voice indicating that the predetermined voice is the predetermined voice, a voice for specifying the document to be processed, a voice indicating the content of the process, and a recorded voice. and a voice for deleting.
The information processing device according to claim 7.

前記処理の対象となる原稿を指定するための音声、および録音された音声を削除するための音声には、見開きの複数枚の原稿のうち対象となる１の原稿を指定するための音声が含まれることを特徴とする、
請求項８に記載の情報処理装置。 The voice for specifying the document to be processed and the voice for deleting the recorded voice include the voice for specifying one document to be the target among the multiple double-page spread documents. characterized by being
The information processing device according to claim 8.

前記処理の内容を示す音声が、前記処理の対象となる前記静止画像のデータの出力形式、属性、および構成の各々を指定するための音声のうちいずれか１以上であることを特徴とする、
請求項８に記載の情報処理装置。 characterized in that the sound indicating the content of the processing is any one or more of sounds for specifying each of an output format, an attribute, and a configuration of data of the still image to be processed;
The information processing device according to claim 8.

前記静止画像のデータの出力形式を指定するための音声が、当該静止画像のデータのファイルの形式、色、および向きのうち、いずれか１以上を指定するための音声であることを特徴とする、
請求項１０に記載の情報処理装置。 The voice for specifying the output format of the still image data is a voice for specifying any one or more of the file format, color, and orientation of the still image data. ,
The information processing device according to claim 10.

前記静止画像のデータの属性を指定するための音声が、当該静止画像のデータの印刷の可否、編集の可否、転記の可否、暗号化の有無、および文字認識時の言語のうち、いずれか１以上を指定するための音声であることを特徴とする、
請求項１０に記載の情報処理装置。 The voice for specifying the attributes of the still image data is any one of the following: whether or not the still image data can be printed, edited, transcribed, whether it is encrypted, and the language used during character recognition. characterized by being a voice for specifying the above,
The information processing device according to claim 10.

前記静止画像のデータの構成を指定するための音声が、当該静止画像のデータにおける原稿の挿入および原稿の削除のうち、いずれか１以上を指定するための音声であることを特徴とする、
請求項１０に記載の情報処理装置。 The voice for specifying the structure of the data of the still image is the voice for specifying one or more of insertion of a document and deletion of a document in the data of the still image,
The information processing device according to claim 10.

複数枚の原稿の動画像を撮像する撮像手段と、
前記撮像と並行して音声を録音する録音手段と、
撮像された前記動画像のデータと、録音された前記音声のデータとを取得し、当該動画像のデータと、当該音声のデータとを対応付けて同一の時間軸で管理する管理手段と、
前記音声のデータから予め定められた音声を検出する検出手段と、
前記動画像のデータから前記複数枚の原稿の静止画像のデータを生成する生成手段と、生成した前記静止画像のデータのうち、前記予め定められた音声が発されたタイミングで撮像対象とされた原稿の静止画像のデータに対し、予め定められた処理を実行する制御を行う処理実行制御手段と、
を有することを特徴とする、
情報処理システム。 an imaging means for capturing moving images of a plurality of originals;
recording means for recording audio in parallel with the imaging;
A management means that acquires the data of the captured moving image and the data of the recorded audio, and manages the moving image data and the audio data in correspondence with each other on the same time axis;
detection means for detecting a predetermined voice from the voice data;
generating means for generating still image data of the plurality of manuscripts from the moving image data; and of the generated still image data, the still image data is set as an imaging target at the timing when the predetermined sound is uttered. processing execution control means for controlling execution of predetermined processing on still image data of the document;
characterized by having
Information processing system.

コンピュータに、
撮像された複数枚の原稿の動画像のデータと、当該撮像と並行して録音された音声のデータとを対応付けて同一の時間軸で管理する機能と、
前記音声のデータから予め定められた音声を検出する機能と、
前記動画像のデータから生成される前記複数枚の原稿の静止画像のデータのうち、前記予め定められた音声が発せられたタイミングで撮像対象とされた原稿の静止画像のデータに対し、予め定められた処理を実行する制御を行う機能と、
を実現させるためのプログラム。 to the computer,
A function for associating and managing video data of multiple imaged manuscripts and audio data recorded in parallel with the imaging on the same time axis;
a function of detecting a predetermined voice from the voice data;
Among the still image data of the plurality of manuscripts generated from the moving image data, a predetermined value is selected for the still image data of the manuscript that is to be imaged at the timing when the predetermined sound is emitted. A function to control the execution of the specified processing,
A program to make this happen.