Blame: README.md - EpistasisLab/tpot

A Python Automated Machine Learning tool that optimizes machine learning pipelines using genetic programming.

10048 0 1 Jupyter Notebook

Update TPOT2 in readme 2024-12-11 12:24:51 -08:00			`# TPOT`
init 2023-03-23 18:59:13 -07:00
Add logo on readme 2024-09-24 19:39:37 -07:00			`<center>`
			`<img src="https://raw.githubusercontent.com/EpistasisLab/tpot/master/images/tpot-logo.jpg" width=300 />`
			`</center>`

			`<br>`

refactoring tpot2 to tpot 2024-12-23 11:11:12 -08:00			`![Tests](https://github.com/EpistasisLab/tpot/actions/workflows/tests.yml/badge.svg)`
			`[![PyPI Downloads](https://img.shields.io/pypi/dm/tpot?label=pypi%20downloads)](https://pypi.org/project/TPOT)`
			`[![Conda Downloads](https://img.shields.io/conda/dn/conda-forge/tpot?label=conda%20downloads)](https://anaconda.org/conda-forge/tpot)`
Added gh-actions for test automation (#4) 2023-04-14 11:18:45 -07:00
update readme with info/license 2024-10-24 11:54:08 -07:00			`TPOT stands for Tree-based Pipeline Optimization Tool. TPOT is a Python Automated Machine Learning tool that optimizes machine learning pipelines using genetic programming. Consider TPOT your Data Science Assistant.`
Update README.md 2023-05-25 12:05:58 -07:00
update readme with info/license 2024-10-24 11:54:08 -07:00			`## Contributors`

			TPOT recently went through a major refactoring. The package was rewritten from scratch to improve efficiency and performance, support new features, and fix numerous bugs. New features include genetic feature selection, a significantly expanded and more flexible method of defining search spaces, multi-objective optimization, a more modular framework allowing for easier customization of the evolutionary algorithm, and more. While in development, this new version was referred to as "TPOT2" but we have now merged what was once TPOT2 into the main TPOT package. You can learn more about this new version of TPOT in our GPTP paper titled "TPOT2: A New Graph-Based Implementation of the Tree-Based Pipeline Optimization Tool for Automated Machine Learning."

			`Ribeiro, P. et al. (2024). TPOT2: A New Graph-Based Implementation of the Tree-Based Pipeline Optimization Tool for Automated Machine Learning. In: Winkler, S., Trujillo, L., Ofria, C., Hu, T. (eds) Genetic Programming Theory and Practice XX. Genetic and Evolutionary Computation. Springer, Singapore. https://doi.org/10.1007/978-981-99-8413-8_1`

Update contributors list format in readme. 2025-01-27 14:04:08 -08:00			`The current version of TPOT was developed at Cedars-Sinai by:`
			`- Pedro Henrique Ribeiro (Lead developer - https://github.com/perib, https://www.linkedin.com/in/pedro-ribeiro/)`
			`- Anil Saini (anil.saini@cshs.org)`
			`- Jose Hernandez (jgh9094@gmail.com)`
			`- Jay Moran (jay.moran@cshs.org)`
			`- Nicholas Matsumoto (nicholas.matsumoto@cshs.org)`
			`- Hyunjun Choi (hyunjun.choi@cshs.org)`
Update readme citations 2025-09-11 14:29:35 -07:00			`- Gabriel Ketron (gabriel.ketron@cshs.org)`
Update contributors list format in readme. 2025-01-27 14:04:08 -08:00			`- Miguel E. Hernandez (miguel.e.hernandez@cshs.org)`
			`- Jason Moore (moorejh28@gmail.com)`

			`The original version of TPOT was primarily developed at the University of Pennsylvania by:`
			`- Randal S. Olson (rso@randalolson.com)`
			`- Weixuan Fu (weixuanf@upenn.edu)`
			`- Daniel Angell (dpa34@drexel.edu)`
			`- Jason Moore (moorejh28@gmail.com)`
			`- and many more generous open-source contributors`
init 2023-03-23 18:59:13 -07:00
			`## License`

refactoring tpot2 to tpot 2024-12-23 11:11:12 -08:00			`Please see the [repository license](https://github.com/EpistasisLab/tpot/blob/main/LICENSE) for the licensing and usage information for TPOT.`
Update TPOT2 in readme 2024-12-11 12:24:51 -08:00			`Generally, we have licensed TPOT to make it as widely usable as possible.`
init 2023-03-23 18:59:13 -07:00
update readme with info/license 2024-10-24 11:54:08 -07:00			`TPOT is free software: you can redistribute it and/or modify`
			`it under the terms of the GNU Lesser General Public License as`
			`published by the Free Software Foundation, either version 3 of`
			`the License, or (at your option) any later version.`

			`TPOT is distributed in the hope that it will be useful,`
			`but WITHOUT ANY WARRANTY; without even the implied warranty of`
			`MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the`
			`GNU Lesser General Public License for more details.`

			`You should have received a copy of the GNU Lesser General Public`
			`License along with TPOT. If not, see <http://www.gnu.org/licenses/>.`

documentation 2023-08-16 12:56:09 -07:00			`## Documentation`

refactoring tpot2 to tpot 2024-12-23 11:11:12 -08:00			`[The documentation webpage can be found here.](https://epistasislab.github.io/tpot/)`
documentation 2023-08-16 12:56:09 -07:00
edit 2023-08-16 12:56:49 -07:00			`We also recommend looking at the Tutorials folder for jupyter notebooks with examples and guides.`
init 2023-03-23 18:59:13 -07:00
			`## Installation`

Update TPOT2 in readme 2024-12-11 12:24:51 -08:00			`TPOT requires a working installation of Python.`
init 2023-03-23 18:59:13 -07:00
			`### Creating a conda environment (optional)`

Update TPOT2 in readme 2024-12-11 12:24:51 -08:00			`We recommend using conda environments for installing TPOT, though it would work equally well if manually installed without it.`
init 2023-03-23 18:59:13 -07:00
			`[More information on making anaconda environments found here.](https://conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html)`

			```
Update TPOT2 in readme 2024-12-11 12:24:51 -08:00			`conda create --name tpotenv python=3.10`
			`conda activate tpotenv`
init 2023-03-23 18:59:13 -07:00			```

fixes, passing tests 2024-04-18 14:20:49 -07:00			`### Packages Used`

feat: Add preliminary support for Python 3.13 2025-04-17 23:51:22 +01:00			`python version >=3.10, <3.14`
fixes, passing tests 2024-04-18 14:20:49 -07:00			`numpy`
			`scipy`
			`scikit-learn`
			`update_checker`
			`tqdm`
			`stopit`
			`pandas`
			`joblib`
			`xgboost`
			`matplotlib`
			`traitlets`
			`lightgbm`
			`optuna`
			`jupyter`
add license information 2024-10-17 16:03:25 -07:00			`networkx`
fixes, passing tests 2024-04-18 14:20:49 -07:00			`dask`
			`distributed`
			`dask-ml`
			`dask-jobqueue`
			`func_timeout`
			`configspace`

			`Many of the hyperparameter ranges used in our configspaces were adapted from either the original TPOT package or the AutoSklearn package.`

init 2023-03-23 18:59:13 -07:00			`### Note for M1 Mac or other Arm-based CPU users`

Update TPOT2 in readme 2024-12-11 12:24:51 -08:00			`You need to install the lightgbm package directly from conda using the following command before installing TPOT.`
init 2023-03-23 18:59:13 -07:00
			`This is to ensure that you get the version that is compatible with your system.`

			```
			`conda install --yes -c conda-forge 'lightgbm>=3.3.3'`
			```

updating readme to include installation of extras 2023-10-12 14:25:40 -07:00			`### Installing Extra Features with pip`

Update TPOT2 in readme 2024-12-11 12:24:51 -08:00			If you want to utilize the additional features provided by TPOT along with `scikit-learn` extensions, you can install them using `pip`. The command to install TPOT with these extra features is as follows:
updating readme to include installation of extras 2023-10-12 14:25:40 -07:00
			```
Update TPOT2 in readme 2024-12-11 12:24:51 -08:00			`pip install tpot[sklearnex]`
updating readme to include installation of extras 2023-10-12 14:25:40 -07:00			```

			`Please note that while these extensions can speed up scikit-learn packages, there are some important considerations:`

			`These extensions may not be fully developed and tested on Arm-based CPUs, such as M1 Macs. You might encounter compatibility issues or reduced performance on such systems.`

			`We recommend using Python 3.9 when installing these extra features, as it provides better compatibility and stability.`


typos 2023-03-24 09:37:32 -07:00			`### Developer/Latest Branch Installation`
init 2023-03-23 18:59:13 -07:00

			```
Update TPOT2 in readme 2024-12-11 12:24:51 -08:00			`pip install -e /path/to/tpotrepo`
init 2023-03-23 18:59:13 -07:00			```

Update TPOT2 in readme 2024-12-11 12:24:51 -08:00			`If you downloaded with git pull, then the repository folder will be named TPOT. (Note: this folder is the one that includes setup.py inside of it and not the folder of the same name inside it).`
			`If you downloaded as a zip, the folder may be called tpot-main.`
init 2023-03-23 18:59:13 -07:00

cleanup 2023-04-11 10:07:04 -07:00			`## Usage`

			`See the Tutorials Folder for more instructions and examples.`

			`### Best Practices`

			`#### 1`
Update TPOT2 in readme 2024-12-11 12:24:51 -08:00			TPOT uses dask for parallel processing. When Python is parallelized, each module is imported within each processes. Therefore it is important to protect all code within a `if __name__ == "__main__"` when running TPOT from a script. This is not required when running TPOT from a notebook.
cleanup 2023-04-11 10:07:04 -07:00
			`For example:`

			```
			`#my_analysis.py`

Update TPOT2 in readme 2024-12-11 12:24:51 -08:00			`import tpot`
cleanup 2023-04-11 10:07:04 -07:00			`if __name__ == "__main__":`
			`X, y = load_my_data()`
Update TPOT2 in readme 2024-12-11 12:24:51 -08:00			`est = tpot.TPOTClassifier()`
cleanup 2023-04-11 10:07:04 -07:00			`est.fit(X,y)`
			`#rest of analysis`
			```

			`#### 2`

			`When designing custom objective functions, avoid the use of global variables.`

			`Don't Do:`
			```
edited global variables example 2023-04-26 10:46:28 -07:00			`global_X = [[1,2],[4,5]]`
			`global_y = [0,1]`
			`def foo(est):`
			`return my_scorer(est, X=global_X, y=global_y)`

cleanup 2023-04-11 10:07:04 -07:00			```

			`Instead use a partial`

			```
			`from functools import partial`

edited global variables example 2023-04-26 10:46:28 -07:00			`def foo_scorer(est, X, y):`
			`return my_scorer(est, X, y)`
cleanup 2023-04-11 10:07:04 -07:00
edited global variables example 2023-04-26 10:46:28 -07:00			`if __name__=='__main__':`
			`X = [[1,2],[4,5]]`
			`y = [0,1]`
			`final_scorer = partial(foo_scorer, X=X, y=y)`
cleanup 2023-04-11 10:07:04 -07:00			```

			`Similarly when using lambda functions.`

			`Dont Do:`

			```
			`def new_objective(est, a, b)`
			`#definition`

			`a = 100`
			`b = 20`
			`bad_function = lambda est : new_objective(est=est, a=a, b=b)`
			```

			`Do:`
			```
			`def new_objective(est, a, b)`
			`#definition`

			`a = 100`
			`b = 20`
			`good_function = lambda est, a=a, b=b : new_objective(est=est, a=a, b=b)`
			```

			`### Tips`

Update TPOT2 in readme 2024-12-11 12:24:51 -08:00			TPOT will not check if your data is correctly formatted. It will assume that you have passed in operators that can handle the type of data that was passed in. For instance, if you pass in a pandas dataframe with categorical features and missing data, then you should also include in your configuration operators that can handle those feautures of the data. Alternatively, if you pass in `preprocessing = True`, TPOT will impute missing values, one hot encode categorical features, then standardize the data. (Note that this is currently fitted and transformed on the entire training set before splitting for CV. Later there will be an option to apply per fold, and have the parameters be learnable.)
cleanup 2023-04-11 10:07:04 -07:00

			Setting `verbose` to 5 can be helpful during debugging as it will print out the error generated by failing pipelines.
init 2023-03-23 18:59:13 -07:00

Update TPOT2 in readme 2024-12-11 12:24:51 -08:00			`## Contributing to TPOT`
init 2023-03-23 18:59:13 -07:00
Update TPOT2 in readme 2024-12-11 12:24:51 -08:00			`We welcome you to check the existing issues for bugs or enhancements to work on. If you have an idea for an extension to TPOT, please file a new issue so we can discuss it.`
init 2023-03-23 18:59:13 -07:00
update readme with info/license 2024-10-24 11:54:08 -07:00			`## Citing TPOT`
init 2023-03-23 18:59:13 -07:00
update readme with info/license 2024-10-24 11:54:08 -07:00			`If you use TPOT in a scientific publication, please consider citing at least one of the following papers:`
init 2023-03-23 18:59:13 -07:00
Update readme citations 2025-09-11 14:29:35 -07:00			`Hernandez, J. G., Saini, A. K., Ghosh, A., & Moore, J. H. (2025). [The tree-based pipeline optimization tool: Tackling biomedical research problems with genetic programming and automated machine learning](https://www.cell.com/patterns/fulltext/S2666-3899(25)00162-X). Patterns, 6(7).`
init 2023-03-23 18:59:13 -07:00
update readme with info/license 2024-10-24 11:54:08 -07:00			`BibTeX entry:`
add license information 2024-10-17 16:03:25 -07:00
Update readme citations 2025-09-11 14:29:35 -07:00			```bibtext
			`@article{hernandez2025tree,`
			`title={The tree-based pipeline optimization tool: Tackling biomedical research problems with genetic programming and automated machine learning},`
			`author={Hernandez, Jose Guadalupe and Saini, Anil Kumar and Ghosh, Attri and Moore, Jason H},`
			`journal={Patterns},`
			`volume={6},`
			`number={7},`
			`year={2025},`
			`publisher={Elsevier}`
update readme with info/license 2024-10-24 11:54:08 -07:00			`}`
			```
add license information 2024-10-17 16:03:25 -07:00
Update readme citations 2025-09-11 14:29:35 -07:00			`Ribeiro, P., Saini, A., Moran, J., Matsumoto, N., Choi, H., Hernandez, M., & Moore, J. H. (2024). [TPOT2: A New Graph-Based Implementation of the Tree-Based Pipeline Optimization Tool for Automated Machine Learning](https://link.springer.com/chapter/10.1007/978-981-99-8413-8_1). In Genetic programming theory and practice XX (pp. 1-17). Singapore: Springer Nature Singapore.`

			`BitTex entry:`

			```bibtex
			`@incollection{ribeiro2024tpot2,`
			`title={TPOT2: A New Graph-Based Implementation of the Tree-Based Pipeline Optimization Tool for Automated Machine Learning},`
			`author={Ribeiro, Pedro and Saini, Anil and Moran, Jay and Matsumoto, Nicholas and Choi, Hyunjun and Hernandez, Miguel and Moore, Jason H},`
			`booktitle={Genetic programming theory and practice XX},`
			`pages={1--17},`
			`year={2024},`
			`publisher={Springer}`
			`}`
			```
add license information 2024-10-17 16:03:25 -07:00
update readme with info/license 2024-10-24 11:54:08 -07:00			`Randal S. Olson, Ryan J. Urbanowicz, Peter C. Andrews, Nicole A. Lavender, La Creis Kidd, and Jason H. Moore (2016). [Automating biomedical data science through tree-based pipeline optimization](http://link.springer.com/chapter/10.1007/978-3-319-31204-0_9). Applications of Evolutionary Computation, pages 123-137.`
add license information 2024-10-17 16:03:25 -07:00
update readme with info/license 2024-10-24 11:54:08 -07:00			`BibTeX entry:`
add license information 2024-10-17 16:03:25 -07:00
update readme with info/license 2024-10-24 11:54:08 -07:00			```bibtex
			`@inbook{Olson2016EvoBio,`
			`author={Olson, Randal S. and Urbanowicz, Ryan J. and Andrews, Peter C. and Lavender, Nicole A. and Kidd, La Creis and Moore, Jason H.},`
			`editor={Squillero, Giovanni and Burelli, Paolo},`
			`chapter={Automating Biomedical Data Science Through Tree-Based Pipeline Optimization},`
			`title={Applications of Evolutionary Computation: 19th European Conference, EvoApplications 2016, Porto, Portugal, March 30 -- April 1, 2016, Proceedings, Part I},`
			`year={2016},`
			`publisher={Springer International Publishing},`
			`pages={123--137},`
			`isbn={978-3-319-31204-0},`
			`doi={10.1007/978-3-319-31204-0_9},`
			`url={http://dx.doi.org/10.1007/978-3-319-31204-0_9}`
			`}`
			```
add license information 2024-10-17 16:03:25 -07:00
update readme with info/license 2024-10-24 11:54:08 -07:00			`Randal S. Olson, Nathan Bartley, Ryan J. Urbanowicz, and Jason H. Moore (2016). [Evaluation of a Tree-based Pipeline Optimization Tool for Automating Data Science](http://dl.acm.org/citation.cfm?id=2908918). Proceedings of GECCO 2016, pages 485-492.`

			`BibTeX entry:`

			```bibtex
			`@inproceedings{OlsonGECCO2016,`
			`author = {Olson, Randal S. and Bartley, Nathan and Urbanowicz, Ryan J. and Moore, Jason H.},`
			`title = {Evaluation of a Tree-based Pipeline Optimization Tool for Automating Data Science},`
			`booktitle = {Proceedings of the Genetic and Evolutionary Computation Conference 2016},`
			`series = {GECCO '16},`
			`year = {2016},`
			`isbn = {978-1-4503-4206-3},`
			`location = {Denver, Colorado, USA},`
			`pages = {485--492},`
			`numpages = {8},`
			`url = {http://doi.acm.org/10.1145/2908812.2908918},`
			`doi = {10.1145/2908812.2908918},`
			`acmid = {2908918},`
			`publisher = {ACM},`
			`address = {New York, NY, USA},`
			`}`
			```
init 2023-03-23 18:59:13 -07:00
Update readme citations 2025-09-11 14:29:35 -07:00			`## Related Papers`

			`Trang T. Le, Weixuan Fu and Jason H. Moore (2020). [Scaling tree-based automated machine learning to biomedical big data with a feature set selector](https://academic.oup.com/bioinformatics/article/36/1/250/5511404). Bioinformatics.36(1): 250-256.`

			`BibTeX entry:`

			```bibtex
			`@article{le2020scaling,`
			`title={Scaling tree-based automated machine learning to biomedical big data with a feature set selector},`
			`author={Le, Trang T and Fu, Weixuan and Moore, Jason H},`
			`journal={Bioinformatics},`
			`volume={36},`
			`number={1},`
			`pages={250--256},`
			`year={2020},`
			`publisher={Oxford University Press}`
			`}`
			```


Skip test_tpot_estimator_predict temporarily 2024-12-11 15:34:33 -08:00			`## Support for TPOT`
init 2023-03-23 18:59:13 -07:00
Update TPOT2 in readme 2024-12-11 12:24:51 -08:00			`TPOT was developed in the [Artificial Intelligence Innovation (A2I) Lab](http://epistasis.org/) at Cedars-Sinai with funding from the [NIH](http://www.nih.gov/) under grants U01 AG066833 and R01 LM010098. We are incredibly grateful for the support of the NIH and the Cedars-Sinai during the development of this project.`
init 2023-03-23 18:59:13 -07:00
typos 2023-03-24 09:37:32 -07:00			`The TPOT logo was designed by Todd Newmuis, who generously donated his time to the project.`