Julie Bourbeillon

Post Doc INRIA

Topics

Research topics

In a context where new technologies and equipment allow for mass treatment of samples and where research teams share more and more acquired data, scientists are facing a major data exploitation problem. More precisely using this data through data mining tools or replacing it in a classical experimental approach require a preliminary grasp on the information space in order to direct the process. But acquiring this grasp on the data is a complex activity which is seldom supported by current software tools. My research aims at assisting researchers in these difficult information gathering, compilation and analysis tasks.

ProteomeBinders:

Project overview:

ProteomeBinders is a European consortium proposing to establish a comprehensive infrastructure resource of binding molecules for detection of the human proteome, together with tools for their use and applications in studying proteome function and organisation.

Currently there is no pan-European platform for the systematic development and quality control for these essential reagents. The consortium aims to provide a set of consistently characterised binders, required to detect all the relevant human proteins in tissues and fluids in health and disease. As the size of the human proteome is at least an order of magnitude greater than the ~ 24.000 protein coding genes known to date, and as for many applications several binders against each target are needed, the scale of our project is potentially immense.

The project will coordinate a European resource by integrating existing infrastructures, reviewing technologies and high-throughput production methods, standardising tools and applications, and establishing a database.

Activities

In the context of this project, I'm participating as an INRIA post-doc in the NA5 networking activities. The aim of this wrok package is to provide bioinformatics ressources for the project. In particular, I'm involved in the NA5.1 networking activity, which is focused on standards, ontology and database schema for ligand binders information.

Indeed to facilitate sharing of ligand binder information, data must be standardised to provide unambiguous descriptions of binders and targets. The objective is to formalise such descriptions in an ontology of binder properties and a set of binder requirements for data presentation and exchange. In addition, the database schema for a central repository of binders against the human proteome has to be developed, providing basic information on each binder/target pair, with links to distributed resources, such as databases maintained by partners.

My work is more precisely focused on adapting the Intact data model for ProteomeBinders and defining the ligand-binder ontology, as a basis for reasoning on the large amount of data which should soon be available. The middle term goal is to develop web services to assist binder producers and users in their everyday tasks such as: planning binder production according to the importance of molecules in molecular pathways or as missing elements in a high-throughput experimental design, choising the most adapted binder or binders for particular experimental settings and experimental goals, etc.

Links

A few external ressources are worth a look around this topic:

Information Synthesis:

Origin of the research

Nowadays information access remains a critical issue for scientists because of the ever rising numbers of publications, the multiplication of open access data repositories and the increasing number of technologies permitting the mass acquisition of data. A typical example could be Tissue MicroArrays (TMA) technology, which is more and more used in oncology research and allows for the mass treatment of hundreds of micro-samples on a single histologic slide. However this kind of technology poses two problems:

  • the design of the experiment and in particular the choice of the samples to include in order to answer precise biological questions
  • the use of data acquired during previous experiments to study new biological problems or extract relevant information.

In particular the second issue of data exploitation is a topic which is becoming classical for experimental sciences in a context where costs in time and materials for each experiment is exploding and where reuse of data generated during previous experiments or by other teams in a new context becomes common.

Reusing data however poses to researchers a real issue of data sets grasping, because the data was the results of other teams or was acquired out of the context of validation of a particular scientific hypothesis in a precisely delimited experimental area. However this preliminary understanding of the considered data set is a mandatory stage for a more advanced exploitation. Data mining tools have to be directed and this can't be done without a minimum knowledge of the data space. In the same trend, the data set can be used to pursue studies according to a more classical experimental approach, through validation of hypothesis on an extract of the data set. The researcher then has to check if available information are sufficient to validate a given hypothesis. This also goes through getting a preliminary grasp on the data.

Approach and proposed solution

This grasp on the data, in the perspective considered in my work, implies to solve a set of complex problems:

  • search and extraction of interesting data for a particular study, using potentially multiple and distant data sources,
  • interesting items aggregation in a single information pool,
  • organisation of relevant elements inside a structure facilitating data grasping,
  • presentation of the relevant elements and of their structural organisation,

Given the complexity of these problems it appears an increasing need for a computerised assistance to help researchers solve them. The proposed solution is a synthesis notion, which federates the activities of information retrieval and extraction, aggregation, organisation and presentation of the data, which underline the data grasping problem. Inspired from Information Retrieval principles, this synthesis is based on an intermediary model between classical Information Retrieval and an information behaviour point of view. This model gives a central role to the goal of the mining or the hypothesis to test by defining a task-oriented Information Retrieval.

In my thesis the model underlying this synthesis concept allows for the operationnalisation of information synthesis through a prototype. The prototype which has been developped is validated by case studies and an user study. It opens interesting prospects for the extension of the model or extensions towards other application domains.

The considered system has been illustrated in the medical field and in particular in the Tissue Microarray technology field. Tissue Microarray (TMA) technology is a new technique which is already frequently used in oncology research. Along with global molecular studies it allows for quick in situ visualisations of molecular targets (ADN or ARN sequences or proteins) in thousands of tissue samples.

Links

A few external ressources are worth a look around this topic: