The Software-Landscape in (Prote)Omic Research

Understanding the complex mechanisms in (micro)organisms has been developed to a big challenge, called system biology. The included “-omic” research fields received a wide focus during the recent years. New terms were formed, like Genome, Transcriptome, Proteome, Metabolome; in total more than 140 [1] and 250 [2] terms ending on the suffix “-ome” or “-omics”have been counted so far. The progress of analytical technologies therein and the possibility of getting more sensitive accurate and robust results was a prerequisite to gain an indepth insight into organisms.

Understanding the complex mechanisms in (micro)organisms has been developed to a big challenge, called system biology.The included "-omic" research fields received a wide focus during the recent years.New terms were formed, like Genome, Transcriptome, Proteome, Metabolome; in total more than 140 [1] and 250 [2] terms ending on the suffix "-ome" or "-omics"-have been counted so far.The progress of analytical technologies therein and the possibility of getting more sensitive accurate and robust results was a prerequisite to gain an indepth insight into organisms.
Not only organisms are complex, but also is the analytical hardware to obtain data just as the software landscape that helps in data interpretation.Figure 1 exemplarily represents, how complex the possibilities of analysing the "-omes" and their data workout can be [3].To analyse the proteome of a microorganism for example, proteins can be separated with liquid chromatography (LC) or 2D electrophoresis (2DE).Using software, like Melanie or 2DHunt, the results can be interpreted and further conclusions about the proteome can be drawn.An alternative approach employs the fragmentation of proteins (i.e.their tryptic peptides) in mass spectrometry (i.e.LC-MS), using software like MASCOT, PEPSEA or Peptidemass for getting information about the proteome of the microorganism (as shown in Figure 1).
A good example for the integration of new informatics and technologies in "-ome" research fields was shown in a recently published review [4].The review describes the dependency of metabolomics research from bioinformatics such as: streamlining data acquisition (e.g., data alignment, automated metabolomics, and cloud based metabolomics), feature analysis (e.g.mass spectral annotations, statistical analysis, and targeted validation), pathway analysis and the biological context.It appears that bioinformatics help researchers to identify metabolite features from LC-MS data and to describe their biological roles by identifying their involvement in chemical pathways.
Using the example of mass spectrometric technologies associated with various software tools in proteomics, we want to demonstrate the general complexity of the software landscape in "-omics" research up to system biology.The term proteome occurred in 1994 and since that time the amount of publications rose extremely.In the year 2014 almost 7.400 publications with the keyword "proteomic" were listed in Pubmed (Figure 2a), about 2.800 publications with the keywords "proteomic" and "mass spectrometry", about 350 with "proteomic" and "software" and about 200 including all three terms "proteomic", "mass spectrometry" and "software".In all categories the publications have been doubled since 2004.This clearly shows that the combination of analytical methods for proteomics and software development meanwhile evolved into an important field of research, resulting in a large number of available software tools.

Abstract
(Bio)Informatics plays a major role in (prote)omic research experiments and applications.Analysis of an entire proteome including protein identification, protein quantification, detecting biological pathways, metabolite identification and others is not possible without software solutions for analyzing the resulting huge data sets.In the last decade plenty of software-tools, -platforms and databases have been developed by vendors of analytical hardware, as well as by freeware developers and the open source software community.Some of these software packages are very much specialized for one (omic) topic, as for example genomics, proteomics, interactomics or metabolomics.Other software tools and platforms can be applied in a more general manner, e.g. for generating workflows, or performing data conversion and data management, or statistics.Nowadays the main problem is not to find out a way, how to analyze the experimental data, but to identify the most suitable software for this purpose in the vast software-landscape.This review focuses on the following issue: How complex is the link between biology, analysis and (bio) informatics, and how complex is the variety of software tools to be used for scientific investigations, starting from microorganisms up to the detection of a proteome.Thereby the main emphasis is on the variety in software for (LC) MS(/MS) proteomics.In the World Wide Web sites like ExPASy show extensive lists of proteomics software, leaving it to the user to identify which software actually serves their purposes.
First we consider the huge variability of software in the field of proteomics research.Then we take a closer look on the variability of MS data and the incompatibilities of software tools with respect to that.We give an overview over commonly used software technologies and finally end up with the question, whether open source software would not add more value to this field.research field.Usually there is an initial need for the construction of databases and the archiving of results derived from proteomic assays.Furthermore there is a demand for tools for protein identification and quantification, software that models the predictions for reactions, and much more besides.Overall there are many open problems and challenges for software development in proteomics.Vendors of analytical instruments were the first to develop appropriate software platforms.Later on, individual proteomic research groups developed or asked for further methods or tools, which were not in the classical focus of these initial platforms.Nowadays an increasing number of bioinformaticians is engaged in the development of software for the '-omics' research fields and this community may grow in the next years furthermore.Additional to the related heterogeneity of developers, one has to consider that software developing technologies are subject to rapid changes.Typically, vendors develop closed source software, which is often not compatible with other platforms, whereas researchers and the bioinformatics community mostly develop freeware or open source.This currently leads to a large number of software tools, a high variability and an unfortunate incompatibility of analytical data as well as a weak changeability of analytical data software and software tools.Furthermore, interoperability of software is not always realised, thus the use of different software languages, different development platforms and different development philosophies needs to be considered.

Variability and Incompatibility of Analytical MS data
For the area of mass spectrometry several open data formats have been established so far, due to the often proprietary software formats used by commercial appliance manufacturers and software platforms.1 and table 2. Further information can be retrieved from there.
Figure 2: a) PubMed search with the key words "Proteomic" (blue), "Mass Spectrometry" (red), "Software" (green), and with AND-connection between "Proteomic" and the other key words (violet) in "All Fields" with PubMed; b) PubMed search with the key words "Enzyme" and "Software" and "Mass Spectrometry" as And-connection (AND) in "Title/Abstract" with PubMed.
In recent years, the Proteomic Standard Initiative (PSI) of the Human Proteome Organzation (HUPO) has developed various standards (e.g.mzdata, see Table1 "1 data formats") based on the "Extensible Markup Language" (XML), a text-based language for describing hierarchical data.XML-formats are very well suited for data exchange due to the structured composition of these text documents (i.e.clearly defined elements with markers for beginning and end, respectively).
Nowadays, several mass spectrometer manufacturers offer software solutions for evaluation of proteomic experiments in combination with their devices (Table 1 "2.commercial software").Commercial software solutions generally work with proprietary file formats.But due to the large "proteomics community" and its pressure, many commercial manufacturers and distributers offer an exchange format based on XML (see above) and/or deliver their software together with a so called data converter.This necessity arose from the fact that many laboratories and institutes kept developing new methods for analysing MS-or MS/ MS-data, that, however, the standard software could not yet support.Thereby new algorithms and specific software tools were developed.The proprietary data formats had to be decoded in order to be able to test the new algorithms with existing data.Through constant development, free data converters were created, such as ReAdW (.raw-converter), mzWiff (.wiff-converter), MassWolf (.raw-Konverter) and Trapper (.d directories data-converter).These were all transferred to the software msconvert, which is part of the ProteoWizart libraries [5] (Table 1).Still there are other open-source solutions, like OpenChrom [6], that offer more formats specific to certain manufacturers.
The demand for new software is often caused by the needs of users from individual research groups.They work on very specific topics and though search for automated solutions to evaluate their data.Usually these users do not wish to evaluate the data manually with a table calculation program like Microsoft Excel.Therefore purposebuilt software tools and algorithms are developed.An example for very specialized software is Achroma, which was developed for the evaluation of continuous flow LC-MS enzymatic assays in the field of functional proteomics [7,8].The combination of well-known software/algorithms and these specific ones could however be extremely valuable.This is a big challenge for the future, especially to enable the interoperability between these software platforms and tools.
To some extent, interoperability is already provided by data exchange formats (Table 1 "1.Data formats").But how can this applied to the software tools and platforms?For this purpose it makes sense to examine the programs from the point of software development view.Due to the large number of different solutions it is not possible to unite all solution in one platform.On the other hand there are of course many solutions which use the same or similar approaches for evaluating data.They resemble another in terms of their algorithms or their overlap in solution approaches.Amongst this multitude of solutions some can be found with a function as "pipeline", like the Trans-Proteomic Pipeline (TPP)-software [9] (Figure 3) or TOPP [10].These "pipelines" are specialised on connecting previously existing tools into a processing chain automatically executed in sequence.

Large Number and High Variability of Software Tools for Data Evaluation
In relation to the tools depicted in Figure 1, the distinction must be made that entire software systems and "small" software tools exist for particular tasks, such as individual specific algorithms, for example search algorithms for protein databases.Many companies offer complete software packages which employ an easy-to use graphic user interface and enable the execution of several steps within a single software system.In order to avoid blurring only a few tools are depicted in this figure.In reality these relations are far more complex.
The proteomics website of ExPASy [11] gives a good overview of the diversity of the software landscape in proteomics.Under the heading of proteomics, it lists 31 databases and 240 software tools in eight categories: 1) protein sequence identification, 2) mass spectrometry and 2-DE data, 3) protein characterisation and function, 4) families, patters and profiles, 5) post-translational modifications, 6) protein structure, 7) protein-protein interaction, and 8) similarity search/alignment.But the whole list at ExPASy, does not represent the complete range.For example, the proteomic tools site [12] of the Seattle Proteome Center contains 32 software tools and the software site [13] of the Pacific Northwest National Laboratory 52 softwares.Beside the software systems belonging to manufacturers of appliances such as mass spectrometers, there is by now a large open source community and many freely accessible software tools (freeware) which often have their origin in research institutes or universities.These tools are mostly not listed by ExPASy, since they are highly specific or have not yet reached the level of popularity required for a respective linkage.In addition, there are many specific categories in the free online-encyclopaedia Wikipedia list of proteomics software.An example for this is the "List of mass spectrometry software" [14].It contains 76 commercial and free software systems, categorized into three main groups ("Proteomics software", "MS/MS peptide quantification" and "Other software").Another Wikipedia Site is "ms-utils" [15] (229 softwares listed).The listed software in Wikipedia is not a reliable scientific source, but it displays the variety of the software landscape.Experience shows that many searches by users are starting from Google and Wikipedia.
Many software tools have been developed in the area of protein/ peptide identification and quantification, both main topics in proteomics.Enumerating all of these would go beyond the dimension of this paper.However, there is an extensive, but not exhaustive, list of referring commercial and free software platforms and tools in Table 1.The software listed there can only partially be found in ExPASy and thus extends the diversity of software solutions for evaluating proteomic experiments in a very complementary way.The main focus of this list is on the proteomics field as well as on the programming language and platform.
The demand for software to evaluate specific, possibly untypical MS data is still high.Particularly in functional proteomics for the data output resulting from batch or continuous flow enzyme assays [16], there is still a big need for optimising the evaluation.A search using PubMed with key words: "enzyme" and "software" and "mass spectrometry" for example delivered 56 results overall since 2002 (Figure 2b).The highest level has been reached in 2014 with 13 publications and 39 of the 56 publications have been registered since 2009.However this search also shows that the development of software for specialized topics like "MS based enzyme assays" is still a very new field of research, which is not in the focus of main proteomics research.
Finally, in the World Wide Web one can find plenty lists for application specific software and also platforms for listing software tools and systems, but no platform that helps the researcher and analytical scientists to identify the most suitable software for their problems (without searching for hours in the internet).An idea for solving this problem could be a web based search platform, in which researcher describe (in headwords) the analytical evaluation tasks and, based on this description, the search platform will result in a set of currently relevant software tools, collected from several sites in the World Wide Web.

The Applied Software Languages and Development Platforms
The previous explanations obviously indicate that there is already a large diversity of software solutions concerning proteomics.They vary from exchange data formats, over appropriate converters and solutions for the management of proteomic data, to solutions developed specially for proteomics and software combining several software solutions to a processing sequence (Table 1).But users cannot easily combine these programs, since in many cases they are not compatible due to the different technologies used for development.To understand this, it is necessary to take a closer look on software architecture and the methods used for software development.
First of all there is to consider the underlying programming language.Two main programming languages are popular for the development of such software solutions: on the one hand the languages of the C-family: C, C++ and C#, where C++ is most widespread, and on the other hand the programming language Java.
For historical reasons the majority of commercial systems is programmed in C and C++, since these generally originate from the manufacturers of appliances and in the classical study of mass spectrometers.C and C++ were for a long time the standard languages for device oriented application development and are still used in that area nowadays.The source code is compiled into machine code, which means that the programms are very much dependent on specific hardware features.Therefore many of these applications have the disadvantage that they run exclusively on specific platforms, i.e. hardware combined with its operating system (like Microsoft Windows or (exclusively) Linux).
Next, one has to consider the specific approach to software development.A large group of free or even open source software was developed with the emergence of the "Proteomics Community".In the case of open source the entire source code is public available.Thus many different people can participate and help in development and support.These freely accessible software solutions are frequently platform independent or at least run with both, Linux and Windows operating systems.This leads to the fact that in many cases these tools are used for development and research.Table 1 "3 Free and open source platforms and software tools" shows clearly that in this area the programming language Java and its auxiliaries (sixteen software solutions are listed) is becoming standard, although C++ (ten entries) is still widespread.However, particularly for new software solutions developed since 2010 the majority use Java.The largest advantage of using Java is that -thanks to its technology using a virtual machine-the software can be run independently of the underlying specific operating system.Python (seven entries) for example is as well a programming language that is not compiled into machine code but interpreted and in the meantime widely used in the area of natural sciences, since Python is very well adapted for batch processing and therefore is very suitable for the programming of pipelines.Table 1 (third part) also lists web based solutions with likewise nine entries, which may gain more interest from the bioinformatics and proteomics community in the future, particularly with the continued worldwide development of web based programming (Web 2.0) and Cloud solutions.Furthermore, when we consider extendability and compatibility of software we have to take the underlying technologies for building a software architecture into account.Many so called "Integrated Development Environments" (IDE), like e.g.Eclipse [17] or NetBeans [18], provide an elegant mechanism for extending an application.The main software package then contains just the core functionality.Optional features can be made available using a so called plug-in technology.A plug-in is an encapsulated part of software with a well-defined interface that can be plugged into an existing software application in order to enhance its functionality (Figure 4A).This enables customization of applications, without needing to update the full application.A further benefit of the plug-in technology is the possibility of re-using the plugins in other software projects.This fits perfectly, if the underlying technology is the same, as e.g. when using eclipse and the eclipse market place.In this way the programming environment already provides most parts necessary to generate a modular project.Such tools that allow programmers to easily integrate further software components, where most of the processing is executed on the client side, is called Rich Client Platform (RCP).Figure 4 illustrates by means of an example of the Eclipse IDE platform.The left hand side (Figure 4A) schematically depicts the fact that the entire application is composed of plug-ins.Even the Eclipse core, which constitutes the starting basis for the application development is itself a plug-in, the so called entry-point of the application.Figure 4B) shows the concept of an analytical software platform, called openMASP (open Modular Analytical Software Platform) [19]: the green area represents the layer that interacts with the users the freely programmable functional plug-ins in connection with the graphical user interface (the UI).These functions are based on the implementation of various data structures.The blue area in the middle comprises the modules that are responsible for the connection to the core plug-ins belonging to Eclipse (or another RCP-platform) and to the "Java Virtual Machine" that executes the interpreted Java code.The "Java Virtual Machine" provides an execution environment, independent of the underlying operating system.By this way we achieve so called "Cross-Platform" applications.
A good example for such an extendable application is OpenChrom [6], a software for chromatography and mass spectrometry.It is built with Eclipse RCP technology.The main focus is on handling mass spectrometry files.For example, OpenChrom can handle many venture formats natively and can be used to analyse GC/MS, and LC/MS data.It contains a growing number of processing and analysing procedures.Another example is Maui that is built on top of the NetBeans RCP technology.It provides a user interface to handle Maltcms (Modular Application Toolkit for Chromatography Mass-Spectrometry) [20], a framework mainly developed for developers in the domain of bioinformatics for metabolomics and proteomics.It offers integrated functions to handle different file formats, like mzXML, mzData and netcdf.The data are processed using a free configurable processing queue.Maui gives the users a visual interface to control the Maltcms framework and display its processing results.for free and adopt it to for personal needs, assumed that the technical skills are at hand.It is possible to handle files and data using the projects infrastructure and integrate further software modules for analysis easily.Therefore the standard analysis related modules of the applications can be enriched by this way of customization, giving the user potential in data analysis and allowing individual customized solutions.Both projects are part of the Eclipse Science Working Group [21].
Beside local and server based platforms, online databases and tools are commonly used in proteomics.These are available via the internet and most of them are free to use.Some of them provide only access with a browser, as e.g.www.chemicalize.org.Others offer a so called Web Service.This is an interface designed for machine to machine communication.The service provider defines the interface and decides on the protocol for communication.Any computer can connect to such a Web Service via the protocol and request the service, as for example a search in a database.The result will be resend to the computer which must be capable of interpreting the result.So it is possible to use data from online services in desktop applications.When using web-based applications or databases one has to consider license issues.Many free data sets are provided under a certain license, like e.g. the Creative Commons, that might restrict the data usage to non-commercial applications.These issues sometimes lack adequate awareness from the users.
In the last years Cloud Computing has come into the focus of the public.The definition of Cloud Computing by the National Institute of Standards and Technology (NIST) [22] is: "Cloud computing is a model for enabling ubiquitous, convenient, on-demand network access to a shared pool of configurable computing resources (e.g., networks, servers, storage, applications, and services) that can be rapidly provisioned and released with minimal management effort or service provider interaction." Using a cloud in modern analytical research may provide the advantage that data can be processed in powerful external processing centres and can be easily shared with other researcher in research institutions worldwide.
The most interesting service model for cloud computing is the so called SaaS (Software as a service, Figure 5).The resources for heavy calculations are in the cloud that is built by the servers and the network of the specific provider.Users can use the high computing power of the cloud service provider in order to execute their algorithms.The client machines would just be used for configuring and starting computing pipelines, but the execution itself would be carried out on the server machines.After processing just the results will be transferred to the client, which means that such a scenario would neither require a particular high bandwidth nor a high end client computer.Analysis processes can particularly be accelerated with the power of the cloud as soon as there are algorithms that can be scheduled in parallel.Such a service can be used through many different clients, like a web browser or a program on the client computer.The user has no control of the underlying cloud components.
A critical examination of cloud computing may easily lead to serious concerns about data security.Data shared in a public cloud may be available for other users or organisations.But on the other side, the utilisation of cloud computing can lead to an improved flow of knowledge by sharing data and can speed up the analysis processes.When using a community cloud, that means a restricted accessible cloud as opposed to a public cloud, data can be shared among trusted and cooperating laboratories and therefore a substantial benefit may be achieved.

Open Source, Freeware or Commercial Software -What is the Best Way?
In Table 1 and above it is shown, that there are many very useful software solutions available, but often are not interoperable among each other.For that reason, researchers usually have to use a lot of different software tools to evaluate proteomics data.In our opinion a big challenge for the future is to bring commercial, freeware and open source together and make them interoperable.A first step in this direction was done by the founding of the Science Working Group of the Eclipse based platforms [21].This group is currently a consortium of 23 open source projects in the field of natural sciences, 6 proprietary projects, 4 miscellaneous projects and 10 software companies.Their aim is to make their products interoperable and interchangeable.That would add values for the users of these products within the Eclipse community.
Still it remains open, how other software tools can be integrated.Our suggestion is to develop an open platform for collecting all relevant software tools in the field of proteomics or biochemistry, similar to the Science WG of Eclipse, but open for all the solutions, independent of the programming language or programming platform used.In order to achieve this, first of all the vendors of analytical hardware would have to open their own data formats for more interoperability.Furthermore standard interfaces need to be defined and pushed forward to achieve the required interchangeability between the several software solutions.At the end the research in proteomics could become easier, much more structured and much more reproducible.
It cannot be decided which of the two approaches, open-source or proprietary, is better or worse in general.One has to choose the right tool for the right purpose.In high dynamic areas where requirements change rapidly , open source software has proven to be a good decision, since existing software that is based on open source can be extended or adapted to new situations, without needing too many expensive resources (such as money or manpower).Furthermore, developing or extending an open source application, results in substantially better options, with regard to the sharing and development of new and innovative algorithms and workflows.The Open Source approach often leads to the establishment of new communities, which can further extend the software according to new perspectives and ways of working.

Conclusion
It could be shown, that the software landscape in the proteomics research field, as an example for omic-fields, is very heterogeneous and different.Plenty of tools and platforms are available and it is the user's choice, which is the best software for analyzing proteomics data.Data exchange and interoperability between the software is problematic and last but not least the software is developed with different programming languages and different development strategies.So, three conclusions can be made to improve this situation: firstly, the proteomics community should bring researcher, commercial, freeware and open source developer together to make the software solutions interoperable and interchangeable.A good step in this direction was the founding of the science working group of the Eclipse Foundation.Secondly, the definition of standards and interfaces for interoperability between the software solutions should be the next step.The definition of data exchange formats, like mzML, is not enough.Thirdly, the development of a web based platform (as well as pipelines) which help researcher finding the best software in the variety of software solutions could be helpful to improve the data evaluation.

Figure 1 :
Figure1: Depiction of the relations between various analyzing techniques and the individual software tools and databases, partially adopted from[3].The softwares are listed in ExPASy or table 1 and table 2. Further information can be retrieved from there.

Figure 4 :
Figure 4: A) plug in technology illustrated by the example of the Eclipse platform; schematic representation of the connection between plug-ins.B) Example openMASP: open Modular Analytical Software Platform; a demonstration project to show the advantages of Eclipse RCP-technology for programming of every time extensible modular analytical software; the base builds the Eclipse core plug-ins and OSGI-plugins above the 'java virtual machine'; the next layer are the data structure plug-ins and the top layer are the functional plug-ins and the user interface.

Figure 5 :
Figure 5: CDSclipse: Chromatography Data System based on Eclipse -a software project study for a SaaS chromatography data system based on Eclipse RCPtechnology.The server delivers all functionality to the clients (workstations, LC/MS systems or Tablets).A) the left side shows the Eclipse Rich Client and its functions (data import/export, visualization) and its independence from any system (Webclient, Eclipse plug-in, local client, external software); the right side shows the server and the necessary/possible modules.B) visualisation of the analytical data flow (green line) and the hardware control flow (yellow line) and some examples of possible hardware setups under control of CDSclipse.

Table 1 :
List of software tools and platforms for the evaluation of proteomic data got from mass spectrometer detection.