CN116230087B

CN116230087B - Method and device for optimizing culture medium components

Info

Publication number: CN116230087B
Application number: CN202211538905.5A
Authority: CN
Inventors: 陈亮; 张博睿; 陈红; 胡志鹏; 梁国龙
Original assignee: Shenzhen Taili Biotechnology Co ltd
Current assignee: Shenzhen Taili Biotechnology Co ltd
Priority date: 2022-12-02
Filing date: 2022-12-02
Publication date: 2024-05-14
Anticipated expiration: 2042-12-02
Also published as: CN116230087A

Abstract

The invention provides a method and a device for optimizing culture medium components. The method comprises the steps of taking each component of a culture medium as an input characteristic, taking daily response as a target value, establishing a machine learning model and calculating a correlation coefficient; calculating the feature importance of each input feature, and picking out the first k items of the feature importance scores to be classified as a first set; all components with negative correlation coefficients for the daily response are marked as negative factors and all the negative factors are classified into a second set; taking the intersection of the first set and the second set to obtain a negative factor set; and removing one or more negative factor sets from the culture medium component set to obtain an optimized culture medium component. According to the invention, the importance of the components of each culture medium is calculated through machine learning, the interaction among different components in the culture medium is considered, the components with negative correlation are removed by combining the correlation analysis correlation method, the accuracy of component screening is improved, the optimization of small sample size is realized, the research and development period is shortened, and the repeated test is reduced.

Description

Method and device for optimizing culture medium components

Technical Field

The invention relates to the field of screening of effective components of a culture medium, in particular to a method and a device for optimizing the components of the culture medium.

Background

Orthogonal test design refers to a test design method for researching multiple factors and multiple levels. And selecting partial representative points from the comprehensive test according to the orthogonality to test, wherein the representative points have the characteristics of uniform dispersion and alignment. The main tool of the orthogonal test design is an orthogonal table, a tester can search a corresponding orthogonal table according to the requirements of the factor number, the level number of factors, interaction and the like of the test, and select partial representative points from the comprehensive test to test according to the orthogonality of the orthogonal table, so that the equivalent result of a large number of comprehensive tests can be achieved with the minimum test times.

The existing culture medium component screening method is based on the experience of biological research personnel, refers to related documents, and adds components useful for cell growth and expression into the culture medium. The final composition is then determined by orthogonal experimental design and single factor experimental analysis. However, in orthogonal test designs, when the factors involved in the test are 3 or more and there is an interaction between the factors, the test effort becomes large or even difficult to implement.

Machine learning is a branch of artificial intelligence. The machine learning theory mainly designs and analyzes some algorithms which enable a computer to automatically learn, and comprises methods such as Support Vector Regression (SVR), decision trees, gradient lifting trees (Boosting type algorithms, GBDT), random forests, multi-layer perceptrons (MLP) and the like. Machine learning considers individual components in the medium and then fits the data by building different models. The machine learning has the greatest advantage that the modeling can be performed by using the existing test results without expert experience, and accurate prediction is given. After a certain training, the matching and characteristic (component) importance of the optimal culture medium can be calculated through machine learning. Since prediction is performed by machine learning, it is necessary to collect a large amount of test data (at least 10 times the number of samples as the number of features are required, in deep learning, the number of samples is usually several tens of thousands) based on a large amount of test data, and in actual production, it takes a relatively high time cost as well as an economic cost.

Therefore, the existing screening method of the effective components of the culture medium has the problems of high cost and low efficiency caused by neglecting interaction among various factors in the culture medium or needing a large number of culture tests.

Disclosure of Invention

The invention mainly aims to provide a method and a device for optimizing culture medium components, which are used for solving the problems of long culture medium optimizing time period and high cost in the prior art.

In order to achieve the above object, according to one aspect of the present invention, there is provided a method for optimizing a medium composition, the method comprising: taking each component of the culture medium as an input characteristic, taking the daily response as a target value, establishing a machine learning model, and calculating a correlation coefficient; calculating the feature importance of each input feature, picking out the first k items of the feature importance scores, and classifying the first k items into a first set; all components with negative correlation coefficients for the daily response are marked as negative factors, and all the negative factors are classified into a second set; taking the intersection of the first set and the second set to obtain a negative factor set; and removing one or more negative factor sets from the culture medium component set to obtain an optimized culture medium component.

Further, the daily response comprises at least one of: cell expression level, cell density and cell viability; preferably, the machine learning model is a regression analysis model, and preferably, the method for calculating the correlation coefficient includes a partial least square method or a Pearson correlation coefficient.

Further, recording all components having negative correlation coefficients for the daily response as negative factors, and classifying all negative factors into a second set includes: negative factors of which the negative correlation coefficients monotonically decrease along with time are selected and classified into a second set.

Further, the sample data set is divided into a first response volume data set and a second response volume data set, wherein the daily response volume of the first response volume data set is higher than the corresponding daily response volume of the second response volume data set; calculating a correlation coefficient matrix of each input characteristic and daily response in the first response data set and the second response data set; screening out all components with the correlation coefficient smaller than 0 and monotonically decreasing along with time according to the correlation coefficient matrix, and marking the components as a second set; preferably, the sample data with the first 20% -30% of the response is divided into a first response data set, and the rest is divided into a second response data set.

Further, after obtaining the optimized culture medium components, the method further comprises the step of experimental verification; preferably, the negative factor set is removed from the medium composition set in at least one of 1) deleting all components of the negative factor set from the medium composition set; 2) Deleting the components in the negative factor set from the culture medium components one by one; 3) And removing known essential components from the negative factor set according to the known information to obtain an updated negative factor set, and deleting all components in the updated negative factor set from the culture medium component set.

In order to achieve the above object, according to a second aspect of the present invention, there is provided an apparatus for optimizing a medium composition, comprising: the model building module is used for building a machine learning model by taking the culture medium component set as an input characteristic and the daily response as a target value and calculating a correlation coefficient; the important feature selection module is used for calculating the feature importance of each input feature, picking out the first k items of the feature importance scores and classifying the first k items into a first set; the negative factor selecting module is used for marking all components with the negative correlation coefficient of the daily response as negative factors and classifying all the negative factors into a second set; the intersection module is set to take the intersection of the first set and the second set to obtain a negative factor set; and the rejecting module is used for rejecting the components in one or more negative factor sets from the culture medium component set to obtain the optimized culture medium component.

Further, the negative factor selection module includes: a data set dividing module configured to divide the sample data set into a first response volume data set and a second response volume data set, wherein a daily response volume of the first response volume data set is higher than a corresponding daily response volume of the second response volume data set; a correlation coefficient matrix calculation module configured to calculate a correlation coefficient matrix of each input feature and the daily response in the first response data set and the second response data set; the screening module is arranged for screening out all components with the correlation coefficient smaller than 0 and monotonically decreasing along with time according to the correlation coefficient matrix, and recording the components as a second set; preferably, the sample data with the first 20% -30% of the daily response is divided into a first response data set, and the rest is divided into a second response data set.

Further, the apparatus further comprises an experiment verification module configured to perform a biological experiment on the optimized medium composition after the negative factor set is removed from the medium composition set, and to measure a daily response of the optimized medium composition, thereby determining a final negative factor.

Further, the culling module comprises at least one culling sub-module 1 arranged to delete all components of the negative factor set from the medium component set; a culling sub-module 2 arranged to delete components of the negative factor set from the medium components one by one; and the rejecting submodule 3 is used for removing known necessary components from the negative factor set according to known information to obtain an updated negative factor set, and deleting all components in the updated negative factor set from the culture medium component set.

According to a third aspect of the present invention there is provided a computer readable storage medium comprising a stored program, wherein the program when run controls a device in which the storage medium is located to perform a method of optimizing a composition of any of the above media.

According to a fourth aspect of the present invention there is provided a processor for running a program, wherein the program is run to perform a method of optimizing any one of the above media composition.

By applying the technical scheme of the application, the components with negative correlation effect on the culture effect are removed by combining the correlation method of correlation analysis on the basis of calculating the characteristic importance of each culture medium component by machine learning, and in addition, the interaction among different components in the culture medium can be considered by the characteristic importance screening in the machine learning model, so that the accuracy of component screening can be improved, and the better effect can be obtained in a small sample scene. By adopting the machine learning method to model and analyze the data, the research and development period of culture medium optimization is shortened, and repeated tests are reduced.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this specification, illustrate embodiments of the application and together with the description serve to explain the application. In the drawings:

FIG. 1 shows a schematic flow diagram of a method for medium optimization in a preferred embodiment according to the invention;

FIG. 2 is a schematic view showing the construction of a culture medium optimizing apparatus according to a preferred embodiment of the present invention;

FIG. 3 is a block diagram showing the hardware structure of a method for optimizing a medium in a preferred embodiment according to the present invention.

Detailed Description

It should be noted that, without conflict, the embodiments of the present application and features of the embodiments may be combined with each other. The present application will be described in detail with reference to examples.

As mentioned in the background art, in order to solve the problems of long time period and high cost in the optimization of the culture medium components in the prior art, the inventor tries to improve the existing machine learning method, and finds that the method can shorten the research and development period of the culture medium optimization and reduce repeated experiments, so as to put forward a series of protection schemes of the application.

In a first exemplary embodiment of the present application, a method for optimizing a culture medium is provided, the method comprising the steps of:

s1, taking a culture medium component set as an input characteristic, taking a daily response as a target value, establishing a machine learning model, and calculating a correlation coefficient;

s2, calculating the feature importance of each input feature, picking out the first k items of the feature importance scores, and classifying the first k items into a first set;

S3, marking all components with negative correlation coefficients corresponding to the response quantity as negative factors, and classifying all the negative factors into a second set;

s4, taking an intersection of the first set and the second set to obtain a negative factor set;

S5, removing one or more negative factor sets from the culture medium component set to obtain the optimized culture medium component.

According to the method for optimizing the culture medium, components which have negative correlation effects on culture effects are removed by combining a correlation method of correlation analysis on the basis of calculation of the feature importance of each culture medium component by machine learning, and in addition, interaction among different components in the culture medium (which is reflected by the feature importance in a machine learning model, in the modeling process, the model calculates interaction among all features, specifically, the influence among the features is calculated by means of conditional probability, feature column sampling and the like), so that the accuracy of component screening can be improved, and a better effect can be obtained in a small sample scene. By adopting the machine learning method to model and analyze the data, the research and development period of culture medium optimization is shortened, and repeated tests are reduced.

In the above method, the daily response amount varies depending on the specific microorganism species or cell type to be cultured, and specifically, the daily response amount includes, but is not limited to, the cell expression amount and/or the cell density. The machine learning model can be reasonably selected from the existing machine learning methods, such as methods of Support Vector Regression (SVR), decision trees, gradient lifting trees (Boosting algorithm, GBDT), random forests, multi-layer perceptrons (MLP), and the like. In the present application, a regression analysis model is preferable, and the calculation method of the correlation coefficient preferably includes a partial least square method or a Pearson correlation coefficient.

In the above step S3, the negative factors, which are components having negative correlation coefficients with respect to the response, are selected so that the negative factors are removed from the medium components in order to obtain an optimized medium. In particular, there are a plurality of methods for selecting negative factors according to the negative correlation coefficient, in order to increase the stability of the method, in a preferred embodiment of the present application, all components having negative correlation coefficients for the daily response are marked as negative factors, and classifying all negative factors into the second set includes: negative factors of which the negative correlation coefficients monotonically decrease along with time are selected and classified into a second set.

Monotonically decreasing is used to describe the increasing and decreasing of a function value over a certain interval with x, and if it is now known that a function f (x) decreases monotonically over interval D, it is intuitive that the function value (increasing with x) decreases all the time over interval D, rather than having two alternating cases of increasing and decreasing. By selecting such monotonically decreasing negative factors, components having low importance to the culture effect can be selected relatively directly and rapidly from the method for subsequent biological experimental verification. That is, the medium candidate formulation can be optimized quickly.

In order to screen negative factors relatively more accurately, in a more preferred embodiment, step S3 comprises: dividing the sample data set into a first response volume data set and a second response volume data set, wherein the response volume of the first response volume data set is higher than the corresponding response volume of the second response volume data set; calculating a correlation coefficient matrix of each input characteristic and daily response in the first response data set and the second response data set; and screening out all components with the correlation coefficient smaller than 0 and monotonically decreasing along with time as a second set according to the correlation coefficient matrix. Negative factors that can be screened from both the high and low response data sets are relatively more accurate with less predictive of the effect on daily response.

The first response volume data set is a data set with high response volume, and the second response volume data set is a data set with low response volume, wherein the threshold value of the response volume is optionally determined. Preferably, the sample data with the first 20% -30% of the response is divided into a first response data set, and the rest is divided into a second response data set.

According to the method, the machine learning method is combined with the correlation analysis, the negative factors are removed, and then the culture effect of the culture medium formula optimized after the negative factors are removed is verified through a biological experiment method, so that the experiment period can be shortened. Specific ways of specifically rejecting negative factors for biological assay validation include, but are not limited to, at least one of the following ways: 1) Deleting all components in the negative factor set from the culture medium component set; 2) Deleting the components in the negative factor set from the culture medium components one by one; 3) And removing known essential components from the negative factor set according to the known information to obtain an updated negative factor set, and deleting all components in the updated negative factor set from the culture medium component set.

The three negative factors eliminating modes can be reasonably selected according to actual needs. Biological experiment verification can also be carried out one by one in three ways.

Example 2

The embodiment provides an improved culture medium component screening method based on a machine learning model, which is shown in the attached figure 1, and comprises the following steps:

1) Establishing a sample formula database (namely a database of different specific formulas formed by combining multiple components): l-tryptophan, L-cysteine, L-glycine, L-alanine, manganese sulfate monohydrate, cobalt chloride hexahydrate, pyridoxal hydrochloride, ethanolamine, sodium bicarbonate, poloxamer188 (Poloxamer 188 is a nonionic linear copolymer with surfactant properties. Poloxamer188 exhibits antithrombotic, antiinflammatory and cytoprotective activity in various tissue injury models), and the like. The machine learning model is trained based on the existing recipe database. Where the input features are X, (X is the recipe constituent) (i.e., the collection of constituents of all recipes, e.g., the database has 200 recipes, where each recipe consists of 100 constituents X refers to a 200X 100 matrix (200 rows, 100 columns) with a target value of Y (final cell density or final cell expression).

2) The R ² coefficients are calculated according to a machine learning model obtained by training about 1600 samples (namely 1600 specific formulas) of a culture medium formula database. Which is used in statistics to measure the proportion of variability of the dependent variable that can be accounted for by the independent variable interpretation portion, to determine the interpretation ability of the regression model. The calculation formula is as follows:

Where y _i is the observed value of the response of the recipe sample, Is the model predictive value corresponding to the formula sample,/>Is the average of the response of the formulation samples. When R ² >0.80, we consider the machine learning model to be more accurate in predicting recipe response. The machine learning model in this embodiment is GBDT, R ² =0.81.

3) And calculating the feature importance of each feature, and picking out the top k items of the feature importance scores. Feature importance is a means of scoring input features based on how useful the input features are in predicting a target variable. The relative score may highlight which features may be relevant to the target and vice versa which features are least relevant. The calculation method of the feature importance is different according to different machine learning models. In this embodiment, GBDT models are used, and the feature importance is calculated based on the average gain of the feature segmentation.

4) The data set is divided into a high-response data set and a low-response data set. In this embodiment, the samples with the top 25% of the response values are divided into high-response data sets, and the remaining samples are divided into low-response data sets. The following steps are performed in the two data sets, respectively:

and calculating a Pearson correlation coefficient matrix of each input characteristic and the daily response. The correlation coefficient is the amount of linear correlation between the study variables. The larger the correlation coefficient is, the stronger the correlation between the two variables is. The calculation formula is as follows:

r (X, Y) represents the correlation coefficient between the variable X and the variable Y, cov (X, Y) is the covariance between the variable X and the variable Y. σ _X,σ_Y represents the standard deviation of variable X and variable Y, respectively. Taking daily cell density as a daily response as an example, the correlation coefficient between each input characteristic x ₁、x₂、x₃、……、x_m and the daily response y1, y2, y3 and … yn is calculated, m is the total number of culture medium components, and n is the number of days of cell growth. We can derive a correlation coefficient matrix Rm, n,

For row i, if the elements in the row are all less than 0, this means that the component has a negative effect on cell density.

If r (x 1, y 1) > r (x 2, y 3) > … > r (xi, yn), i.e., the correlation coefficient of component i decreases monotonically with time, and component i belongs to the k features before feature importance scoring, then the component is determined to be a negative factor. Taking the intersection of the negative factors in the low and high response data sets allows for the deletion of this component in subsequent experiments. All negative factors meeting the above conditions are formed into a negative factor set. The daily response in this example is the cell density in the medium at day 3, day 5 and day 7, respectively.

5) Designing an experimental scheme for eliminating negative factors, wherein the experimental scheme for eliminating the negative factors comprises three types:

Scheme 1: all components of the negative factor collection were deleted from the media formulation.

Scheme 2: the components in the negative factor set are deleted from the medium formulation one by one.

Scheme 3: according to known experience, if non-deletable components, such as essential amino acids, are present in the negative factors, these components are deleted from the negative factor set and the negative factor set is updated. And deleting all components in the negative factor set from the culture medium formula.

6) And (3) carrying out a biological experiment again on the culture medium from which the negative factor components are removed, measuring the response value of the culture medium, checking the removal effect, and finally determining the components which can be removed. In this example 7 negative factors were screened out, of which 4 components gave an approximately 10% increase in cell density in the medium after removal, after bioassay verification.

Table 1 shows the results of the component deletion experiments.

In Table 1, experiment 8 was a control group, i.e., a medium without any component deletion. Wherein each negative factor was deleted in experiments 1, 2, 3, and 4, respectively. It can be seen that the average culture effect of the medium after the re-deletion of components X1, X2, X3, X4 was improved by about 13%.

Experiments 5-7 in the above table also show that the method of machine learning plus correlation analysis cannot guarantee that the selected negative factor is negative by 100%. Experiments are therefore also required for the experiments. In addition, since the cell density of experiment 7 was within 10% from that of experiment 8 (control group), it was considered that the component could be deleted because the cell density did not significantly decrease after deletion.

Table 2 shows a correlation coefficient matrix

Description: the relationship of the daily response is omitted from the table, and only the correlation coefficient relationship of the response of the last day is intercepted.

As can be seen from table 2 above, each negative factor is inversely related to Y (cell density).

Further description is provided below in connection with alternative embodiments.

Example 3

This embodiment provides a device for optimizing the composition of a culture medium, as shown in fig. 2, the device comprising: a model building module 10, an important feature selection module 20, a negative factor selection module 30, an intersection module 40, and a culling module 50, wherein,

A model building module 10 configured to build a machine learning model with the set of medium components as input features and the daily response as a target value, and calculate a correlation coefficient, wherein the response includes a cell expression amount and/or a cell density;

An important feature selection module 20 configured to calculate feature importance of each input feature and pick the top k terms of the feature importance score, categorized as a first set;

A negative factor selection module 30 configured to record all components for which the response is a negative correlation coefficient as negative factors, and to classify all negative factors into a second set;

An intersection module 40 arranged to take the intersection of the first set and the second set to obtain a negative factor set;

A culling module 50 is arranged to cull components of the one or more negative factor sets from the set of media components to obtain an optimized media component.

It should be noted that the machine learning model may be reasonably selected according to various known machine learning algorithms, and is preferably a regression analysis model in the present application. Preferably, the calculation method of the correlation coefficient includes, but is not limited to, a partial least square method or a Pearson correlation coefficient.

Optionally, the negative factor selection module includes: a data set dividing module configured to divide the sample data set into a first response volume data set and a second response volume data set, wherein the response volume of the first response volume data set is higher than the corresponding response volume in the second response volume data set; a correlation coefficient matrix calculation module configured to calculate a correlation coefficient matrix of each input feature and the daily response in the first response data set and the second response data set; and the screening module is arranged for screening out all components with the correlation coefficient smaller than 0 and monotonically decreasing along with time according to the correlation coefficient matrix, and recording the components as a second set.

Optionally, sample data with the first 20% -30% of the response is divided into a first response data set, and the rest is divided into a second response data set.

Optionally, the apparatus further comprises an experiment verification module configured to perform a biological experiment on the optimized medium composition after the negative factor set is removed from the medium composition set, and measure the response of the optimized medium composition, thereby determining the final negative factor.

Optionally, the foregoing rejecting module includes at least one rejecting sub-module 1 configured to delete all components of the negative factor set from the medium component set; a culling sub-module 2 arranged to delete components of the negative factor set from the medium components one by one; and the rejecting submodule 3 is used for removing known necessary components from the negative factor set according to known information to obtain an updated negative factor set, and deleting all components in the updated negative factor set from the culture medium component set.

Example 4

The embodiment provides a computer readable storage medium, the storage medium includes a stored program, wherein when the program runs, a device where the storage medium is controlled to execute any one of the above methods for optimizing a culture medium.

A processor is also provided for running a program, wherein the program runs on performing any of the methods of media optimization.

It should be noted that, for simplicity of description, the foregoing method embodiments are all described as a series of acts, but it should be understood by those skilled in the art that the present invention is not limited by the order of acts described, as some steps may be performed in other orders or concurrently in accordance with the present invention. Further, those skilled in the art will appreciate that the embodiments described in the specification are presently preferred embodiments, and that the acts are not necessarily required for the present invention.

From the above description of the embodiments, it will be clear to those skilled in the art that the present application may be implemented by means of hardware devices such as software and detection devices. With such understanding, portions of the data processing in the present application may be embodied in the form of a software product, which may be stored in a storage medium, such as a ROM/RAM, magnetic disk, optical disk, etc., including instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform the methods of various embodiments or portions of embodiments of the application.

The application is operational with numerous general purpose or special purpose computing system environments or configurations. For example: personal computers, server computers, hand-held or portable devices, tablet devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, and the like.

The method provided by the application can be executed in a terminal, a computer terminal or similar computing device. Taking the example of running on the terminal, FIG. 3 is a block diagram of the hardware structure of the terminal of a method for eliminating base sequencing errors and/or a method for identifying low frequency mutations according to an embodiment of the present application. As shown in fig. 3, the terminal may include one or more processors 102 (only one is shown in fig. 3) (the processor 102 may include, but is not limited to, a microprocessor MCU or a processing device such as a programmable logic device FPGA) and a memory 104 for storing data, and optionally, a transmission device 106 for communication functions and an input-output device 108. It will be appreciated by those skilled in the art that the structure shown in fig. 3 is merely illustrative and is not intended to limit the structure of the terminal. For example, the terminal may also include more or fewer components than shown in fig. 3, or have a different configuration than shown in fig. 3.

The memory 104 may be used to store computer programs, such as software programs and modules of application software, such as computer programs corresponding to the methods of read splicing, clustering, consistency processing, etc. in the embodiments of the present invention, and the processor 102 executes the computer programs stored in the memory 104 to perform various functional applications and data processing, i.e., implement the methods described above. Memory 104 may include high-speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, the memory 104 may further include memory remotely located relative to the processor 102, which may be connected to the terminal via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The transmission device 106 is used to receive or transmit data via a network. The specific example of the network described above may include a wireless network provided by a communication provider of the terminal. In one example, the transmission device 106 includes a network adapter (Network Interface Controller, simply referred to as a NIC) that can connect to other network devices through a base station to communicate with the internet. In one example, the transmission device 106 may be a Radio Frequency (RF) module, which is configured to communicate with the internet wirelessly.

It will be apparent to those skilled in the art that some of the modules or steps of the application described above may be implemented in a general purpose computing device, they may be concentrated on a single computing device, or distributed across a network of computing devices, or they may alternatively be implemented in program code executable by a computing device, so that they may be stored in a memory device for execution by the computing device, or they may be separately fabricated into individual integrated circuit modules, or multiple modules or steps of them may be fabricated into a single integrated circuit module. Thus, the present application is not limited to any specific combination of hardware and software.

From the above description, it can be seen that the above embodiments of the present application achieve the following technical effects: based on the calculation of the importance of the machine learning to each feature, a correlation method of correlation analysis is combined. Meanwhile, interaction among different components in the culture medium can be considered, and the accuracy of component screening can be improved, so that a better effect can be obtained in a small sample scene. By adopting the machine learning method to model and analyze the data, the research and development period of culture medium optimization is shortened, and repeated tests are reduced.

The above description is only of the preferred embodiments of the present invention and is not intended to limit the present invention, but various modifications and variations can be made to the present invention by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A method of medium composition optimization, the method comprising:

Taking each component of the culture medium as an input characteristic, taking the daily response as a target value, establishing a machine learning model, and calculating a correlation coefficient;

Calculating the feature importance of each input feature, picking out the first k items of the feature importance scores, and classifying the first k items into a first set;

All components with negative correlation coefficients for the daily response are marked as negative factors, and all the negative factors are classified into a second set;

taking the intersection of the first set and the second set to obtain a negative factor set;

Removing one or more components in the negative factor set from the culture medium component set to obtain optimized culture medium components;

Recording all components of negative correlation coefficients for the daily response as negative factors and classifying all negative factors into a second set comprises: selecting the negative factors of which the negative correlation coefficients monotonically decrease along with time, and classifying the negative factors into the second set;

Wherein the selecting of the negative factors includes:

Dividing a sample data set into a first response volume data set and a second response volume data set, wherein the daily response volume of the first response volume data set is higher than the corresponding daily response volume of the second response volume data set;

Calculating a correlation coefficient matrix of each of the input features and the daily response in the first response data set and the second response data set;

and screening out all components with the correlation coefficient smaller than 0 and monotonically decreasing with time according to the correlation coefficient matrix, and marking the components as the second set.

2. The method of claim 1, wherein the daily response comprises at least one of: cell expression level, cell density and cell viability.

3. The method of claim 1, wherein the machine learning model is a regression analysis model.

4. The method of claim 1, wherein the correlation coefficient calculation method includes a partial least squares method or a Pearson correlation coefficient.

5. The method of claim 1, wherein the first 20% -30% of sample data of the response is divided into the first response data set, and the rest is divided into the second response data set.

6. The method of claim 1, wherein after obtaining the optimized medium composition, the method further comprises the step of experimental validation.

7. The method of claim 6, wherein the negative factor set is eliminated from the collection of media components in at least one of the following,

1) Deleting all components in the negative factor set from the culture medium component set;

2) Deleting components in the negative factor set from the culture medium components one by one;

3) And removing known essential components from the negative factor set according to known information to obtain an updated negative factor set, and deleting all components in the updated negative factor set from the culture medium component set.

8. An apparatus for optimizing the composition of a culture medium, the apparatus comprising:

The model building module is used for building a machine learning model by taking the culture medium component set as an input characteristic and the daily response as a target value and calculating a correlation coefficient;

The important feature selection module is used for calculating the feature importance of each input feature, picking out the first k items of the feature importance scores and classifying the first k items into a first set;

a negative factor selection module configured to record all components having negative correlation coefficients for the daily responses as negative factors, and to classify all the negative factors into a second set;

An intersection module configured to take an intersection of the first set and the second set to obtain a negative factor set;

A rejecting module configured to reject one or more components of the negative factor set from the culture medium component set to obtain an optimized culture medium component;

wherein, the negative factor selection module includes:

A data set dividing module arranged to divide a sample data set into a first and a second response volume data set, wherein the daily response volume of the first response volume data set is higher than the corresponding daily response volume in the second response volume data set;

A correlation coefficient matrix calculation module configured to calculate a correlation coefficient matrix for each of the input features and the daily response in the first response data set and the second response data set;

and the screening module is arranged for screening out all components with the correlation coefficient smaller than 0 and monotonically decreasing along with time according to the correlation coefficient matrix, and recording the components as the second set.

9. The apparatus of claim 8, wherein the daily response comprises at least one of: cell expression level, cell density and cell viability.

10. The apparatus of claim 8, wherein the machine learning model is a regression analysis model.

11. The apparatus of claim 8, wherein the correlation coefficient calculation method includes a partial least squares method or a Pearson correlation coefficient.

12. The apparatus of claim 8, wherein the first 20% -30% of sample data of the daily response is divided into the first response data set, and the rest is divided into the second response data set.

13. The apparatus of claim 8, further comprising an experiment verification module configured to determine a final negative factor by performing a biological experiment on the optimized medium composition after the negative factor set is removed from the medium composition set, and measuring the daily response of the optimized medium composition.

14. The apparatus of claim 8, wherein the culling module comprises at least one culling sub-module 1 configured to delete all components of the negative-going factor set from the medium component set;

A culling sub-module 2 arranged to delete components of the negative factor set from the medium components one by one;

and the rejecting submodule 3 is used for removing known essential components from the negative factor set according to known information to obtain an updated negative factor set, and deleting all components in the updated negative factor set from the culture medium component set.

15. A computer readable storage medium, characterized in that the computer readable storage medium comprises a stored program, wherein the program when run controls a device in which the storage medium is located to perform the method of optimizing the composition of a medium according to any one of claims 1 to 7.

16. A processor for running a program, wherein the program runs on performing the method of optimizing the composition of a medium according to any one of claims 1 to 7.