We are thrilled to announce that pyOpenSci has received 2 years of funding to cover core operations from the Chan Zuckerberg Initiative (CZI). These CZI funds will be used to continue critical pyOpenSci work that:
Open and reproducible science builds trust and accelerates research and discovery. Open science supports scientific research that is both transparent and reusable. Free and open source software is critical to open science as it ensures that the analyses of research data are broadly accessible. To build truly open research workflows, scientists need to use free and open source software (FOSS). FOSS removes the barriers that licenses and other fees may create making diverse participation more accessible.
Broad inclusion of underrepresented and historically excluded individuals is critical to pyOpenSci’s mission. The open source ecosystem has been found to be even less diverse then the broader tech community. To help remedy this, pyOpenSci empowers everyone to participate in both the development and use of open source software. This empowerment enables open science to reach its full potential.
Despite the importance of open source software to fundamental open science principles, open source maintainers do not get the support they need. Maintainers need both institutional and community support in learning and engaging in open source software development, maintaining tools and engaging with the broad user base that may begin to use their tools in support of open science.
pyOpenSci is pushing for change. We envision a world where:
The three programs pyOpenSci runs are:
Peer review of scientific Python software - The pyOpenSci software peer review process helps scientists find the vetted and trusted tools they need to build reproducible open science workflows. We also empower our community with critical open science skills that support contributing to open source software.
Community partnerships with domain-specific scientific Python communities. Domain-specific communities partner with PyOpenSci to leverage pyOpenSci’s peer review process as a way to track vetted, high quality tools. Communities also support the development of Python packaging packaging guidelines in an effort to streamline the development of packaging recommendations across the scientific Python ecosystem.
Training and resources to help scientists develop and maintain high-quality, accessible, open source software. Our community developed the Python packaging guide, which provides resources and tutorials that help scientists navigate a complex Python packaging ecosystem. The Python packaging guide also makes recommendations for community accepted best practices.
pyOpenSci is excited to grow a more inclusive and supportive scientific Python community in 2024!
We are always looking for volunteers to support our community programs. If you are interested in getting involved with pyOpenSci, learn more here.
]]>I’m thrilled to be joining pyOpenSci (pyOS) as the Community Manager, and bringing my experiences as a researcher, educator, and developer advocate to support the creation and maintenance of free and open Python tools for processing scientific data. I’ve spent most of my adult life building online communities, from video games to programming to data science, and it’s exciting to see so many familiar faces in my first weeks at pyOS.
One of the things I’m most excited to bring to pyOS is the creation of engaging, approachable, and accessible technical content around some of the thornier Python issues, like packaging! Whether it’s blog posts, social media content, or videos, I want to ensure that our educational materials are available to everyone, regardless of geography.
Another area that I’ll be particularly focused on is the pyOS Peer Review program. The Peer Review program with pyOS supports scientists in getting credit for the efforts they’ve invested in open source Python tools, while also supporting the standardization of packaging and improved package visibility. And as the pyOS Community Manager, I want to help celebrate all of the hard work that authors, maintainers, and editors put into package development by sharing behind-the-scenes stories, package announcements, and the contributions of every individual with the broader community.
When I’m not helping build equitable, diverse, and accessible technical communities, I’m chilling with my two cats, Jinx and Luna, or riding my bike on the incredible gravel trails in the Chicago area.
I first got started in scientific programming during my time as a graduate student in Immunology and Infectious Diseases. Learning R opened up a world of possibility for me, and I eventually used that early experience to build a career in data science. And out of all the learning materials available, I found the community to be the most critical in my success. While I’m no longer spending my days coding, I wanted to continue working with supportive, welcoming, and educational online communities that were focused on helping its members solve technical challenges.
pyOpenSci meets all of those criteria, and the more I learned about its philosophy around community building, as well as all of the ways for people of all levels of technical knowledge to get involved, the more I knew I wanted to be a part of things. I’m still within my first month at the organization, but I can already tell that it’s a special place to be. Keep an eye out, because I’ve already started planning content and activities to help grow our community!
I’ll also be managing the social media accounts for pyOpenSci, and there are so many fantastic places for us to connect and have a conversation. pyOpenSci is currently active on LinkedIn, BlueSky, and Mastodon, where we’ll be sharing all kinds of news, updates, and community spotlights. Be sure to follow us so you don’t miss out on anything! And join our growing Discourse community, which is a great place to connect with other pyOpenSci community members, learn more about pyOpenSci, and get answers to all of your burning Python packaging and code questions.
I can’t wait to connect and help build the pyOpenSci community with you!
]]>In October 2023, the United States Research Software Engineering (US-RSE), funded by the Sloan foundation, held its very first ever meeting.
I attended this meeting and lead a community session around our peer review process and Python packaging. Key TL&DR takeaways were:
The good news take away is that the exact pain points described to me at this meeting are the ones that pyOpenSci is taking on. We are currently working on an end-to-end packaging tutorial that hopefully will shed light on and demystify a complex but vibrant ecosystem of Python packaging tools and options.
So, what is an Research Software Engineer (RSE)? While the position has existed for some time, an RSE can be loosely defined as someone who uses code regularly to do research. Many RSE’s also work on developing software, in Python.
As a side note this is a position that hasn’t been traditionally supported by default academic environments even though this work is critical to research in the open science space. The RSE position should be a clearly defined, funded, respected and supported career path in all academic institutions that want to embrace open science as a way of doing research. My opinion is: academia needs to extend the definition of academic products to including not only publications but other output that is equally valuable and even more critically important such as software. RSE roles should be associated with clear academic career paths.
Why? Because software is DRIVING open science. Without open source code, you can’t make a workflow truly open and reproducible. This is why peer review is so important to pyOpenSci. We want to support maintainers in both developing and maintaining the critical software that is driving science. We also want them to get credit for their work which is why we partner with the Journal of Open Source software.
The RSE position should be a clearly defined, funded, respected and supported career path in all academic institutions that want to embrace open science as a way of doing research.
Ok i’ll casually step off my soap box now… back to regular scheduled programming we go…
I lead a pyOpenSci Birds of a Feather (BoF) session at the Chicago RSE meeting. BoF sessions are informal community gatherings around a specific topic. BoF’s provide a chance for the community to engage with each other, ask questions, provide input and even get involved in an effort.
I spent most of my time in this 1.5 hour session talking about pyOpenSci and the work we are doing related to:
We had a lively discussion around packaging - more on that below.
I used Mentimeter to drive an engaging and interactive session. You can check out the slides below:
Mentimeter allowed me to capture audience feedback in several forms both verbal and via phones and computer through mentimeter.
I’ll try to summarize if all for you here.
In our BoF, I introduced the three core programs that pyOpenSci currently runs which are:
One of our community members, Isabel, suggested a great icebreaker question : how long has it been since you had a broken Python environment.
It is no surprise that most pythonistas regularly deal with environment challenges.
So if you’ve been in this boat too, you are are not alone!
Full disclosure the one person who voted for 0 days, admitted to the fact that they hadn’t used Python in the past month. :)
In the BoF i asked some questions related to Python packaging so we could have a discussion around some of the challenges. This is useful to pyOpenSci as we developing our packaging guide and associated resources to guide the scientific community through the process of creating a package.
Below you can see a word cloud generated from the question
“What Python packaging tools have you used?”
A few things that popped out to me included:
The broad take away from this graphic is that there are a LOT of tools available for scientists to use. And becoming familiar with all of these is a big ask for a scientist who just wanted their code to be reusable by others.
We have some work to do at pyOpenSci to demystify this ecosystem!
The other telling question was “what are your biggest challenges in the Python packaging ecosystem?
The responses were varied but can be grouped into several
It’s no surprise that there were a handful of responses related to the volume of packaging options which make it hard to figure out which path to take when creating a package. Not only that but the options have changed over time as standards have evolved in the ecosystem
I think the comment below summarizes this well:
too many options, and tutorials feel like consensus documents rather than making strong recommendations for One Best Way
The good news is that this is exactly the pain point that pyOpenSci is working on. You can check out our community-driven packaging guide which presents an overview of the ecosystem with recommendations for best practices. This guide has been reviewed by dozens of Pythonistas in our ecosystem including those who built and maintain core packaging tools.
Currently, we are developing packaging tutorials that answer the most fundamental question:
How do I create a (pure) python package?
All of our packaging content is community driven, created using a robust community review process for packaging experts across our ecosystem. We feel confident that we will be able to shed some light on this complex and evolving ecosystem.
Admittedly many of the questions that I received in this session were about packaging. The community generally frustrated sometimes turning to tools such as ChatGPT to ask questions about packaging.
As someone who has spent a large amount of time testing ChatGPT with packaging questions, I can tell you with certainty that it will lead you in a confusing direction as it doesn’t have it’s packaging facts straight.
Proceed with caution!!
We also got some other questions about our peer review process, how we interact with JOSS (Journal of Open Source Software) and how our partnership programs work.
I’ll save those questions for a followup post on pyOpenSci programs.
My key takeaways from the US-RSE meeting is:
As for me, I hope to attend the 2024 RSE meeting and to continue this important conversation around Python packaging and peer review of software.
]]>IMPORTANT: we are no longer accepting applications for this position. Thank you for your interest.
pyOpenSci is accepting applications for a Community Manager. The Community Manager supports growth and development of an inclusive pyOpenSci community. Our vibrant community is dedicated to supporting high quality Python open source software that drives open science.
While our organization is global we can only accept applications from candidates who are eligible for employment in the United States.
pyOpenSci is a diverse scientific open source / open science community that:
pyOpenSci’s core program is open peer review of scientific Python software. Through peer-review we enforce community-defined packaging standards while improving usability, documentation and package quality. Further, we are creating a catalog of vetted scientific Python tools that have ongoing maintenance for scientists to use.
Core to our mission is increasing participation of groups that have historically underrepresented / excluded from the open source and open science community. We do this through mentorship, training and strategic partnerships with existing organizations such as MetaDocencia, Open Science Labs and other organizations.
pyOpenSci is a community-owned organization, fiscally sponsored by Community Initiatives. We are grateful to be funded by the Sloan foundation.
We are looking for someone with experience managing, engaging with and inspiring large, diverse, online communities. This person will promote pyOpenSci’s mission to the broader scientific and tech communities. They will also help develop educational content around Python packaging and open science workflows.
We are looking for someone that is excited about:
We are also looking for someone that is both committed to DEIA work and also has experience working on programs that target increasing diversity in the scientific open source software community. Ideally, we’d like someone who is familiar with both the Python programming language and the goals of open science.
This position provides the opportunity for someone to drive the pyOpenSci mission of building a diverse and supportive community forward.
This is a full-time position with benefits with an ideal start date around November 2023 but no later than January 1, 2024. This position will report to the pyOpenSci Executive Director. Applicants must be eligible for employment in the United States.
Review of applications will begin on September 1, 2023 and will be ongoing until the position is filled. All positions in our organization are grant funded. We currently have funding for this position through at least June 2025. Assuming continued project success, we will continue to seek additional funding to extend the work beyond June 2025.
This position is fully remote. We also prefer that you live in the United States to simplify asynchronous remote work. IMPORTANT: While our reach is global, we can only hire someone who is either a resident / citizen in the United states or a resident with a valid work permit.
All work that you do should exemplify our community Code of Conduct
Community Initiatives is an equal opportunity employer and gives consideration for employment to qualified applicants without regard to age, race, color, religion, creed, sex, sexual orientation, gender identity or expression, national origin, marital status, disability or protected veteran status, or any other status or characteristic protected by federal, state, or local law Posting Contact Information
For questions about the job, please email: admin at pyopensci.org
]]>I was so excited for SciPy this year.
I wanted to spread the word about pyOpenSci’s core mission - supporting the scientific open source Python community. I wanted to get more people involved.
pyOpenSci represents everything that matters most to me:
I am not used to going into a meeting with no specific plans and obligations. While pyOpenSci didn’t get a talk or a community session / BoF this year, we did get a lightning talk! It was a randomized selection, and I threw my name into the bucket (literally) with fingers crossed that i’d get a lightning talk.
And on the final day of the meeting, I was selected to present!
@pyOpenSci got the cutest slides at the lighting talk @SciPyCon #SciPy2023 pic.twitter.com/ZXleLpdkqB
— Cheuk Ting Ho (@cheukting_ho) July 14, 2023
Let me give you the backstory on lightning talks at SciPy. It’s known that moderators will often “play” with those presenting.
Puns are always pervasive and community embraced!
This year there was a “sea” theme featuring sharks and crab claws. 😂 Watch below as the session is started with a crab claw pun by Paul followed up with a shark attack on yours truly from Madicken. You will also learn about the pyOpenSci mission and vision.
A sprint, in the tech world, is a short time period where people on a team work together to complete something on a technical project. At conferences, there are often open sprints. The idea here is that people, often some of whom are new to a project, get together in person and work on things that the project needs.
In our open source world we also have mentored sprints. The term mentored sprints was coined by an amazing team of people including Tania Allard (who’s passion for open source and open data resonates with my own). It focuses on supporting those who are new to sprinting and using platforms such as GitHub in making their first contribution to open source.
Given pyOpenSci’s core values around diversity equity and inclusion, every sprint we hold is a mentored sprint as far as i’m concerned!
This was the second sprint that i’ve lead with the first being at pyCon US 2023.
My friend, colleague and esteemed pyOpenSci advisory council member, Inessa Pawson taught me that:
I went into our SciPy 2023 sprint with a more organized pyOpenSci help-wanted board. This board has been a great way to keep track of things that we need help with.
GitHub PROTIP: I struggled at PyCon with assigning people who didn’t belong to a repository or our organization to specific issues. Now, I know that if someone comments on an issue first, I can then assign it to them (many thanks to Thomas Fan for the tip!!).
I am absolutely blown away by and profoundly grateful for the support that pyOpenSci received at this year’s SciPy sprints!
We had over 20 pull requests emerge from this sprint - WOW! Two sprinters also submitted their first ever contributions!!
Info: a pull request, known as a “pr”, represents a set of suggested changes to a set of code or text. In the GitHub.com interface you can view the suggested changes and comment on them - in the same way that you might comment on suggested changes in a Google doc.
Some of the contributions included:
In case you are curious, most of the pull requests submitted during the sprint this year are listed below:
I left before day two of the sprints. However, that did not stop the community from continuing to sprint and contribute to pyOpenSci! People continued to work additional website fixes that were still open our project board.
I learned a lot this year from SciPy.
Sometimes the best moments are the unexpected ones. I had the chance to connect with amazing individuals and share pyOpenSci’s impactful mission that I care about so deeply.
And the best part? Our pyOpenSci community continues to grow, attracting more wonderful Pythonistas who share our vision. Together, I’m confident that we will make a positive impact on scientific open source Python community.
That’s what truly matters.
And I gave out a lot of pyOpenSci stickers too!
My approach to participating in SciPy was so much better than that at pyCon.
I learned some valuable lessons about taking care of both my work and my mental well-being. As an introvert in a busy meeting filled with awesome colleagues, it’s easy to get burnt out.
Here’s what I did to make sure I left the meeting feeling refreshed and energized:
🌟 Embraced breaks: During the meeting, I consciously took short breaks to unwind. Whether it was chilling in my hotel room or going for a stroll outside, giving my brain a breather made a world of difference. And guess what? I slept better at night too!
In the end, I may have missed a bit of the action, but the payoff was totally worth it. I left the meeting feeling way better than I did after PyCon.
So, fellow introverts, remember this little secret weapon called “recovery time” at your next big event! It’s a game-changer!
Back in March 2023, I made a bold decision to leave a toxic academic environment and fully dedicate myself to building and growing pyOpenSci—an amazing, community-focused organization.
Let me tell you, taking that leap of faith was pretty intimidating. The academic setting had taken a toll on me, shattering my confidence and even affecting my health. But I knew in my heart that I wanted to channel all my energy into community work, collaborating with people who respected and appreciated me as much as I respected them.
And guess what? This journey has been beyond my wildest dreams! Not only has the pyOpenSci community thrived and made a remarkable impact in just its first year, but it has also turned out to be the kind of inclusive, supportive community I always envisioned.
It’s incredible how not only is pyOpenSci helping others, but it’s also been a source of support and healing for me. I couldn’t be more grateful for this vibrant and uplifting environment that we’ve created together.
I’ll keep pushing forward, knowing that this beautiful journey is just the beginning.
Thank you, SciPy for supporting me and reinforcing the fact that I made the right decision! And i’d be remiss if I didn’t also thank the pyOpenSci community that is truly bring pyOpenSci’s vision to life.
And that is all I have to say about SciPy 2023! It was an incredible experience. If you are reading this and we connected at SciPy this year or if you contributed to pyOpensci this year, I just want to say thank you.
From the bottom of my heart. I see change coming in the upcoming years. pyOpenSci wants to be a part of and to drive that change!!
We can’t achieve that without your help!
]]>This year was my first time attending pyCon US! I was intimidated to attend such a big Python meeting. For years i’ve attended science meetings such as AGU (American Geophysical Union), ESRI (GIS) users conferences and ESA (Ecological Society of America). I’ve been to and lead data science hackathons and been to the annual SciPy meeting. But i’ve never been to a pure tech conference.
Even after teaching data science using R and Python for the past 10 years I STILL feel like an imposter sometimes.
What’s up with that?
But I went and had a fantastic time. Getting to talk to people all day, every day about all things Python felt like how I might imagine a trip to Disneyworld feels for an 8 year-old… (minus the cotton candy, costumes and the upside down rides).
I felt energized, excited. I learned so much and met SO MANY incredible people.
A few highlights of the meeting are below.
Did I mention packaging? No?
Ok well I spent a lot of the meeting talking to people about Python packaging.
I even got to present in the maintainers summit (see the video below) on… guess what?
PYTHON PACKAGING!
It was my first time recording myself talking formally about packaging and using OBS studio. And I have to say I have some sort of shifty eye syndrome going on. I think I still have a bit to learn from those YouTubers on video creation!
This presentation echoed the sentiment that I shared in this blog post about Python packaging.
If you’re short on time, the take-aways of the talk were:
I spent a bit of time in that video talking about how we create our guides.
Another part of our packaging guide review process is getting input from packaging experts in the community. These experts come from the core python community, the packaging community, the scientific community and even maintainers of core packaging tools.
Leave no stone unturned (my motto when doing most things).
If you haven’t already guessed, I have a deep seeded love for peer review. In my mind, anything that is produced that requires a lot of technical knowledge will only be improved when vetted by lots of people.
Sure, it takes a lot of extra work to produce a guide that way. And it slows down the process. But, the end product will be worth it.
As such we not only review Python packages at pyOpenSci. We also make sure that all of our content is heavily reviewed too.
So I have to say this. While pyCon was the most wonderful experience, I did have a few awkward moments. For one, I was often one of the few female identifying people in the room. This was particularly true are the packaging summit where I gave a small presentation related to PyPA and packaging tools!
I did feel welcomed and included. BUT, I can say that it is interesting to walk into a room and know that you are different. I noted the things that made me feel super welcome and comfortable.
My 1st #PyCon2023 is FUN. As a female in STEM, tech & #opensource i'm walking into & presenting in rooms full of men. it's been supportive but i've found comfort in ppl actively welcoming me. taking notes 3 developing @pyopensci how 2 implement inclusion strategies #PyConUS2023
— Leah Wasser @leahawasser@fosstodon.org 🦉 (@LeahAWasser) April 22, 2023
I can’t even begin to highlight ALL OF THE AMAZING PEOPLE I met at pyCon. In some cases I met folks who I had been interacting with online Such as C.A.M. from the Python core team who I “met” on the Python Discourse. Or Pradyun another Python core dev who has been involved with pyOpenSci providing guidance on the packaging space for months. Pradyun is also a part of our advisory council. His expertise is really invaluable to our organization.
I got to meet Erik who’s Python package - python-graphblas is going through our peer review process right now.
I met Chase, CEO of a really cool company called Million Concepts and some of the folks from the Python Heliophysics community.
Finally, I got to meet and hang out with the amazing Inessa Pawson. If you haven’t heard of Inessa’s work it’s extraordinary. Inessa has been working as a contributor lead for the Numpy project and is also a project manager for Open Teams.
There are so many other people who I got to know and build working relationships at this meeting.
pyOpenSci also lead 2 sprints at pyCon! On Sunday we lead a mentored sprint. If you haven’t heard of mentored sprints, they are an amazing format that allows those who are newer to contributing to open source to get support in the contribution process.
At the mentored sprints I was grateful to have Luiz Irber and David Nicholson supporting the (larger-than-expected) group! Luiz served as editor for pyOpenSci years ago during one of our very first reviews. And David is now our editor in chief of pyOpenSci.
We had a full table plus an overflow table of people who wanted to contribute! And each of them was able to contribute (many for their very first time!!). It was awesome.
The people at the sprint were not the people who I expected. Many of them had significant technical skills and backgrounds. But many of them also had never committed to an open source project. Perhaps they had used subversion but not git / GitHub. Perhaps they knew python well but didn’t know where to start in terms of contributing.
In total we ended up with 8 pull requests submitted during the sprints and 2 others submitted after. Every pull request was from a new contributor to pyOpenSci. And also they were mostly made by those new to contributing in general.
If you want to check any of them out - please click on any of the links below!
Contributions to open source tools and communities can come in all shapes and sizes. Note that some of the items below are small fixes (which are a huge help). And others are a bit more involved.
These pr’s are highlighted here because it’s important to know that not all contributions need to be highlight technical. They can be text fixes that are equally important to projects.
In the spirit of understanding that not all contributions need to be code! The pr’s below, submitted by Jeremy, identified lots of typos in our packaging guide! This was SO SO helpful to us!!
All of the contributors are now listed on our website. And we are grateful for each and every one of them!
I’m sure I don’t have to tell you the answer to that question.
heck-yea I would!
The people I met at that meeting even sitting in the lunch room were incredible. And many of them have become colleagues that I am still in touch with and who are now involved with pyOpenSci.
One of the other highlights of the meeting that I can’t forget to mention is David getting up on the ginormous speaker stage in the main ballroom and giving a lightning talk about his experience submitting Crowsetta, a Python package, to pyOpenSci. Check out a blog on his talk and watch the 5 minute presentation here. It was awesome!
]]>David Nicholson, our pyOpenSci Editor in Chief, gave a fantastic lightning talk this year at pyCon US 2023. This year’s pyCon was held in Salt Lake City, Utah in April. David braved the expansive keynote room stage - talking to a gigantic room full of Pythonistas. He spoke about his experience going through our scientific Python software peer review process.
Just a few months prior, David had submitted a package he’s been developing called Crowsetta, that helps scientists work with annotations for animal vocalization and bioacoustics data. Given he is the Editor in Chief of our peer review process, we had wonderful volunteers from our editorial team step in to run the review to ensure it wasn’t in any way biased.
In his talk (which was NOT about ChatGPT in case you were wondering :) ), David talked about who was involved and what the process and his experience was like. Check out the video below to learn more!
The talk itself is about 5 minutes long but you can always keep it running to see the other lightning talks posted by the pyCon organizers. All of the pyCon 2023 US talks are online if you want to check out some of the others! You can also check out our pyOpenSci presentation about Python Packaging and experiences with the sprints here if you’d like.
Crowsetta went through our pyOpenSci peer review process in the Spring 2023. If you want to check out the review for Crowsetta, click here.. David also took advantage of our partnership with the Journal of Open Source Software (JOSS) which allowed the Crowsetta to become both a vetted pyOpenSci tool and also to get a cross-ref enabled citation from JOSS. Through this partnership JOSS accepts our review as theirs and only reviews the submitted paper.
David’s talk can be found in our pyOpenSci Zenodo community, here.
]]>I’ve spent the last few months working on creating a Python packaging guide. This guide seeks to help those creating new scientific Python packages select a packaging tool and workflow. This guide also supports the pyOpenSci peer review process.
Below, I provide a brief overview of our content development process given the packaging tool guide chapter has been published! Yay!
There are a few key takeaways from this post:
The packaging chapter of our guide is online now! Stay tuned for more content on environments, CI and testing!
In the Fall of 2022, in support of my new role as Executive Director of pyOpenSci, I began to explore Python packaging tools in an effort to update our guidebook in support of our package peer review program.
I saw significant community confusion around how to create a Python package. But, in my mind, it wouldn’t be that big of a challenge to create a guidebook.
I just needed to find the combination of tools and standards that we could recommend to people in an attempt to demystify the packaging ecosystem.
No problem, right?
At the same time I noticed that many did not want to talk about Python packaging. And I wondered, why?
I’ve worked on the development of 3 other Python packages. Each time my approach to create a package was asking the question:
GeoPandas is a spatial library that supports working with vector data (think points, lines and polygons). I decided to follow their structure, because I respected the Geopandas maintainers greatly, and I had contributed to the package.
My approach to packaging was: “monkey see, monkey do”. I was the monkey.
I also munched on some bananas. It worked out alright.
Copying a package’s structure is like copying code from stack overflow and pasting it into your workflow in hopes it runs. If it doesn’t run, you don’t know enough to fix it! Frustration sets in.
However, at least in Stack Overflow you can see when the post was published and know that it might be dated. I found it hard to find updated information on Python packaging tools. I found this particularly challenging considering I found so many tool options. And each tool had a level of documentation that assumed some depth of knowledge around Python packaging.
Where does the authoritative and complete guide to packaging live and who maintains it? Further, is it helpful enough for a begineer to dig into and get started quickly?
I’ve taught data intensive science for almost 20 years. If there is one thing I know about teaching those new to technical areas, it is that early wins are critical. Whether the win is creating a simple data plot within the first 20 minutes of a workshop, or using an init
method in PDM to create a package structure, early wins can motivate a beginner mind as:
I struggled to find any resources that provided users of Python packaging tools with early wins. Rather, I found that I needed to increase my technical knowledge of packaging to even understand many of the resources out there.
To support pyOpenSci’s goals of making packaging easier for scientists while also improving package quality I knew we needed to create a guide that would help others navigate the packaging ecosystem. At a minimum, helping users understand the tool landscape and how to pick a tool was a good start.
From all of the above I came to a conclusion that Python packaging is not bad. It’s just not well documented. If people understood what all of the tools did and how to pick one, it might be akin to shopping for a car*.
*But without the annoying sales person who might assume you know nothing about cars if you are a women…
In creating this guide, I talked with scientific Python tool maintainers, folks from PYPA, scientific python and maintainers of core packages (such as Flit, Hatch, Poetry and PDM) to get insights into common workflows, common challenges and tools that folks are using. This guide has been a true example of community-driven content. If you are curious, you can see the contributor list here.
The packaging chapter alone had over 200 comments to address in round 1 of reviews. And another 200+ in round 2 of review. All of the chapters in our guide go through community review however this particular chapter elicited a LOT of strong response regarding which tools do what, and how they should be described.
Sometimes, the discussions got tense. People have strong opinions about packaging approaches. Also, not everyone agrees on the best technical approaches. But even more interesting is that many involved knew something about some of the tools but often that was based on word of mouth or a quick glance at documentation. (this is largely because tools are evolving quickly). The people that knew the most were also the most technical, and often involved in the actual development of the tools.
My take away from all of this:
After hundreds of comments and conversations;
After testing each one of the tools in our guide with a start to end workflow;
My takeaway is that Python doesn’t have a packaging problem (if you are a user creating a pure Python package).
Python has a much more human problem where approaches to packaging are simply unclear, not well documented and often debated - heavily.
Further, the standards created for Python packaging while important, live on a website that is not intended for the broader public to use.
Sure, there are many tricky parts to packaging. And understanding the standards can be even trickier. This is certainly not a perfect system.
However, we can create packages using the given existing tools – now! I promise, this is true.
It’s just (extremely) hard to figure out:
The Python ecosystem is evolving rapidly. Approaches evolve and it’s hard to know which approaches are the most current. Those who deeply understand the packaging challenges represent a small subset of the community and also are technically proficient.
In general, users want to use the simplest approach to publish their packages online.
Remember - early wins go a long way.
At the same time there is no good assessment that i’ve seen of the tools that do exist to help users in the ecosystem. I had questions about:
It was clear that people want that guidance.
With this all said, i’ll now set the stage for what’s to come from pyOpenSci in the upcoming months. And what i’ve learned so far.
Right now, Poetry is the most common (modern) packaging tool being used. Have a look at its documentation and you’ll see why! PDM, however, has numerous features that are be ideal for the scientific ecosystem’s needs.
Specifically it allows you to use different build back ends, which is good news if you are creating either a pure python package OR a package with some C/C++ extensions.
Poetry can’t (yet) be a single solution to packaging because right now it’s support of non pure python builds is not documented (and might not ever be). But it could be a great solution for those creating pure Python package.
In the next few blog posts i’m going to present each Python build workflow tool including:
I’ll break down the pros and cons of using each tool. I will also provide examples of what using the tool looks like. In the meantime, check out our shiny new packaging chapter here to see the overview of packaging tools and approaches for scientists creating pure Python packages.
In the very near future i’ll also create some packaging tutorials that will help you get started with creating a new package. Stay tuned for more on that as well!
If you are just getting started with Python packaging OR if you have questions about it, please use our discourse forum to ask questions. We are happy to help!
]]>This blog post is part 3 of a 3 part series on open source package health. The series was inspired by a conversation held on twitter. This blog post is not a comprehensive perspective on what pyOpenSci plans to track as an organization. Rather it’s a summary of thoughts collect during the conversation on twitter that we can use to inform our final metrics.
In this post, I’ll summarize a conversation that was held on twitter that gaged what the community thought about metrics to track the health of scientific Python open source packages.
There are many different ways to think about and evaluate open source Python package health.
Below is what I posted on Twitter to spur a conversation about what makes a package healthy. And more specifically what metrics should we (pyOpenSci) collect to evaluate health.
My goal: to see what the community thought about “what constitutes package health”.
controversial topic: How do we measure the "health" of a #science #python package? GitHub stars? downloads, date of latest commit? # of commits a month / quarter? Spread of commits? Thoughts? #opensource #OpenScience @pyOpenSci
— Leah Wasser 🦉 (@LeahAWasser) October 5, 2022
The twitter convo made me realize that there are many different perspectives that we can consider when addressing this question.
More specifically, pyOpenSci is interested in the health of packages that support science. So we may need to build upon existing efforts that have determined what metrics to use to quantify package health and customize them to our needs.
pyOpenSci does not focus on foundational scientific Python packages like Xarray Dask, or Pandas. Those packages are stable and already have a large user base and maintenance team. Rather we focus on packages that are higher up in the ecosystem. These packages tend to have smaller user bases, and smaller maintainer teams (or often are maintained by one volunteer person).
Our package maintainers:
I’d be remiss if I didn’t mention that there are several projects out there that are deeply evaluating open source package health metrics.
Several people including: Nic Weber, Karthik Ram and Matthew Turk mentioned the value and thought put into the Chaoss project.
Is this something that the @CHAOSSproj work could be specialized and applied to scientific software?
— Matthew Turk (@powersoffour) October 5, 2022
CHAOSS (https://t.co/moiMUeDuS3) has been thinking about this more generally, but it's interesting to think about some of the more "science" aspects.
— Neil P Chue Hong (he/him|they/them) (@npch) October 6, 2022
I've wondered about "frequency of API changes" - for use in research is it healthier to be "stable" or "move fast/break things"?
Not that controversial! Have you looked into the rich body of work that the @CHAOSSproj community has done? Each metric has been explored in great detail
— Karthik Ram (@_inundata) October 6, 2022
The Software Sustainability Institute lead by Neil P Chue Hong has also thought about package health extensively and pulled together some data accordingly. Neil was also a critical guiding member of the earlier pyOpenSci community meetings that were held in 2018.
We also did some initial work on this in 2017 (see slide 12 of this presentation): https://t.co/1F0iMwfT5g
— Neil P Chue Hong (he/him|they/them) (@npch) October 6, 2022
One topic that I am not delving into in this post is security issues. Snyk is definitely a leader in this space and was mentioned at least once in the conversation.
Below are some of the metrics that you can easily access via Snyk’s website.
This might be helpful. This website collects various metrics. And here is the example for numpy. https://t.co/YNsRoMgks4
— Kevin Wang (@KevinWangStats) October 6, 2022
And of course the scientific Python project has also been tracking the larger packages:
So back to the question at hand, what should pyOpenSci be tracking for packages in our ecosystem? Hao Ye (and a few others) nailed it - health metrics are multi-dimensional.
I think, much like "ecological stability" - https://t.co/lJe2Fa0ycR - , "health" here is multi-dimensional and different metrics will capture different facets, such as growth, transparency in governance, stability / backwards compatibility, etc.
— Hao Ye will haunt you for bad keming (@Hao_and_Y) October 5, 2022
I may be a bit biased here considering I have a degree in ecology BUT… I definitely support the ecological perspective always and forever :)
As Justin Kiggins from Napari and CZI points out, metrics are also perspective based. We need to think carefully about the organization’s goals and what we need to measure as a marker of success and as a flag of potential issues.
See insightful thoughts below:
I think that relevant metrics really depend on who is evaluating "health" and what their needs are.
— Justin Kiggins (@neuromusic) October 5, 2022
From UXR work led by @ObusLucy, we found that what users of open source bioimaging plugins are looking for depends on whether they are looking at plugins for general purpose analyses or niche/emerging analyses.
— Justin Kiggins (@neuromusic) October 5, 2022
In the former case, they look for signals of usage (downloads, citations) and in the latter, signals of maintenance and support (commits, comments by dev on issues, etc).
— Justin Kiggins (@neuromusic) October 5, 2022
I suspect this is different from what a funder who is interested in sustainability or a corporation who is interested in their software supply chain would look for to define "health"
— Justin Kiggins (@neuromusic) October 5, 2022
Alas it is true that metrics designed for reporting that a funder requires for a grant may differ metrics designed for internal evaluation that informs program development. pyOpenSci has a lot to unpack there over the upcoming months!
Based on all of the Twitter feedback (below), and what I think might be a start at what pyOpenSci needs, I organized the Twitter conversation into three buckets:
These three buckets are all priorities of pyOpenSci.
DEIA is another critical concern for pyOpenSci but I won’t discuss that in this blog post.
So here I start with Python package infrastructure found in a GitHub repository as a preliminary measure of package health. When think of infrastructure I think about the files and “things” available in a repository that support its use. I know that no bucket is perfectly isolated from the others but i’m taking a stab at this here.
The code for many open source software packages can be found on GitHub. GitHub is a free-to-use website that runs git
which is a version control system. Version control allows developers to track historical changes to code and files. As a platform built on top of git, GitHub allows developers to communicate openly, review new code changes and update content in a structured way.
Ivan Ogasawara is a long time advisor, editor and member of the pyOpenSci community. He’s also a generally a great human being who is growing open science efforts such as Open Science Labs; which is a global community devoted to education efforts and tools that support open science.
Ivan was quick to point out some basic metrics offered by GitHub which follow their community standards online guidebook here.
Maybe not totally related, but github has a section called community standards that could be used as reference, for example: https://t.co/wmu1bDdcQR
— XMN (@xmnlab) October 8, 2022
Actually it’s totally related, Ivan! Let’s have a look look at the pyOpenSci contributing-guide GitHub repository to see how we are doing as an organization.
Note that we are missing some important components:
Um…. we’ve got some real work to do, y’all on our guides and repos. We need to set a better example here don’t we? We welcome help welcome if you are reading this and wanna contribute. Just sayin…
The GitHub minimum requirements for what a software repository should contain
are a great start towards assessing package health. In fact I’ve created a TODO
to add this url of checks to our pre-submission and submission templates as
these are things we want to see too; and also to update our repos accordingly.
Health check #1: are all GitHub community checks green?
Looking at these checks more closely you can begin to think about different categories of checks that broadly look at package usability (readme, description), community engagement (code of conduct, templates), etc.
The GitHub list includes:
But these checks don’t look at what’s in that README, or how the issue templates are designed to invite contributions that are useful to the maintainers (and that guide new potential contributors).
In short, GitHub checks are excellent but mostly exterior infrastructure focused. They don’t check content of those files and items.
As Chris mentions below, things like having a clearly stated goal and intention, likely articulated in the README file is a sign of a healthy project. This goal was ideally developed prior to development beginning. Further, if well-written, it helps keep the scope of the project management.
To this point - i think that an example of a healthy project behavior is that it explicitly states its technical and organizational goals and intentions
— Chris Holdgraf (@choldgraf) October 6, 2022
Another topic that came up in the discussion was testing and test suites. Evan, who has been helping me improve our website navigation suggested looking at test suites and what version of Python those suites are testing against.
My initial reaction is that it should have to do with the presence and quality of automated tests, and the versions of Python those tests are run against.
— Evan (he/him) (@darth_mall) October 5, 2022
I can imagine a small, mature package needs little more than minor updates to run on newer versions of Python.
Test suites are critical not only to ensure the package functionality works as expected (if tests are designed well). They also make it easier for contributors to check if changes they made to the code in a GitHub pull request don’t break things unexpectedly.
Tests can also be created
in a Continuous Integration (CI) workflow to ensure code syntax is consistent
(e.g. linting tools such as Black
) and to test documentation builds for broken
links and other potential errors.
How should pyOpenSci handle Python versions supported in our review process?
In fact the website that you are on RIGHT NOW has a set of checks that run to test links throughout the site and to check for alt tags in support of accessibility (Alt tags support people using screen readers to navigate a website).
How the package is installed is another critical factor to consider. While
these days most packages do seem to be uploaded to PyPI, some still aren’t. And
there are other package managers to consider too such as Conda
.
Lots of thoughts on this... 😂
— Kenneth Hoste (@kehoste) October 6, 2022
One aspect is definitely whether or not the package is published through PyPI, whether it follows standard packaging practices, has a test suite or well documented simple examples of how to use it, etc.
The second topic that came up frequently on Twitter was the issue of maintenance.
Jed Brown had some nice overarching insight here for things they look at that are indicators of both maintenance and bus factor (risk factor, mentioned below as a measure of how many people / institutions support maintenance). More people and more institutions equals lower risk, fewer people and fewer institutions supporting the package equates to a higher maintenance risk (or risk of the package becoming a sad orphan with no family to take care of it.
CI (multi-platform, coverage, static analysis), promptness of reviews, number of distinct institutions who have committed in past 6 months, ditto who have reviewed PRs in past 6 months, promptness of reviews, quality of commit messages and PR discussion.
— Jed Brown (@five9a2) October 5, 2022
How many times have you tried to figure out what Python package you should use to process or download data, and you found 4 different packages on PyPI all in varying states of maintenance?
I’ve certainty been there. So has RenéKat it seems:
I look to see if issues are being resolved. If it’s not being maintained I’m not going to waste my time installing it.
— RenéKat (@renekat14) October 5, 2022
It’s true. For a scientist (or anyone) it’s a waste of time to install something that won’t be fixed as bugs arise. It’s also not a good use of their time to have to dig into a package repository to see if it’s being maintained or not.
pyOpenSci does hope to help with this issue through a curated catalog of tools which will be developed over time.
How do we measure degree of maintenance? Number of issues being addressed and closed? Average commits each month, quarter or year?
This could be a relative metrics too. Some package maintainers may spend lots of time on issues or have too many to handle quickly as Melissa points out replying to a comment about evaluating maintenance by looking at issues being closed:
Would that apply to large established projects such as NumPy? My guess is it wouldn't 😉
— Ax=13!!! (@melissawm) October 5, 2022
But alas I think there are ways around that. We can look at commits, pull requests and such just to see if there’s any activity happening in the repository. Or if it’s gone dark (dark referring to no long being maintained, answer to issues, fixing bugs, etc).
Well, one should look beyond the number of open issues. A lot of them get closed very fast, many prs are merged on a short timescale too. So if you go into a well established repo and see larger numbers, those may still be just the leftover corner cases of decades of usage.
— Brigitta Sipőcz (@AstroBrigi) October 5, 2022
Greg, interestingly suggested one might be able to model expected future lifetime of a package based upon current (and past?) GitHub activity.
Would you accept "expected future lifetime of package" (where "lifetime" means "period of active maintenance") as a measure of health? That feels like something a model could plausibly be trained to predict...
— Greg Wilson (@gvwilson) October 5, 2022
Uh oh! But are commits enough, Kurt asks? Is there such a thing as a perfect project?
Could a project with no recent commits be healthy? What if it needed no commits?
— Kurt Schwehr, PhD (@kurtschwehr) October 6, 2022
Koen had a more broadly profound thought that would be ideal to consider when creating a new package; especially a small package that supports specific scientific workflows.
Does it do one thing, well? Really well?
Yes, please.
My experience from R. None apply if packages stick to the Unix philosophy of doing one thing really well. This will lead to packages with considerable uptake but little development. Base code of {snotelr} is mostly unchanged since inception (12K users).https://t.co/3oKwmeBaA8
— Koen Hufkens, PhD (@koen_hufkens) October 5, 2022
While this might be challenging to enforce in peer review, it is a compelling suggestion.
There is a developer perspective to consider here too. Yuvi Panda pointed out a few items that they look for:
Remember, bus factor has nothing to do with buses, but there is some truth to the analogy of what happens when the wheels fall off.
Without being specific to open science, I always look at: 1. how frequently are commits being merged? 2. what does bus factor look like (is it just 1 person?), 3. What is cadence of release
— Yuvi Panda (@yuvipanda) October 6, 2022
One thought I had here was to look at commits from the maintainer relative to total commits to get a sense of community contribution (if any).
thanks Yuvi! i hadn't heard of the term bus factor before but was thinking that it would be interesting to look at how many commits do NOT belong to the maintainer in a ratio type of form. Since we have the maintainer information from our reviews we could potentially do that.
— Leah Wasser 🦉 (@LeahAWasser) October 6, 2022
The CHAOSS project has an entire working group devoted to risk.
Or perhaps pyOpenSci asks their maintainers what their perceived risk is? IE: how long do you think the package might remain maintained. They will obviously know better than anyone what their funding environment and support it like.
Erik suggested that metrics can be dangerous and somewhat subjective at times. Akin to the whole - maps can lie; data can lie too . Ok it’s our interpretation that is the risk or lie not the data but … you follow me, yea?
Ask developers how comfortable they would be to depend on the package for a new project. I think "health" is largely subjective and I don't trust metrics without context.
— Erik Welch (@eriknwelch) October 5, 2022
Some including Pierre brought up the idea of consistent releases. Not necessarily frequency but just some consistency to demonstrate that the package was being updated.
Yes. Regular releases is a sign of good health. But given the fact that many scientific projects are often maintained by few people I would avoid any normalization. I'm usually super happy with 1 or 2 releases per year.
— Pierre Poulain (@pierrepo) October 5, 2022
Other discussions evolved around semantic versioning and release roadmaps.
Community adoption of an scientific Python package was another broad category seen over and over throughout the Twitter conversation.
While we’d love to quantify citations, the reality of this is that most people don’t cite software. But some do, and we hope you are one of them!
Citations, naturally! 😉
— Jacob Deppen (@jacob_deppen) October 6, 2022
The tweeter below looks at stars and commit date as signs of community adoption and maintenance.
Derivative of 🌟 with respect to ⏲️ plus date of last commit!!
— MLinHydro (@MLinHydro) October 6, 2022
As Chris Holdgraf mentions below, a package can reach a point where the same type of activity can have varying impacts of perceived level of maintenance. Many users opening issues, can represent community interest and perhaps even community adoption. And massive volumes of unaddressed issues might represent unresponsive maintainers.
Or perhaps the maintainers are just overwhelmed by catastrophic success.
I think a steady stream of issues implies a lot of user interest, though I can tell you from first-hand experience that it does not mean a project is healthy :-)
— Chris Holdgraf (@choldgraf) October 7, 2022
I think it misses one of the most stressful anti-patterns for OS projects: Catastrophic Success
h/t @fperez_org :-D
Yup
it is the equivalent of when a small bakery gets written up in the New York Times, has a huge influx of customers, and collapses under the weight of demand. I think it's an outcome we don't think about enough ahead of time
— Chris Holdgraf (@choldgraf) October 7, 2022
But I need at least 5 (thousand) croissants, now. ANDDDD so does my friend.
Juan agrees that a steady stream of issues suggests adoption. Especially since opening issues on GitHub suggests that the users have some technical literacy.
As others have said, it’s multidimensional, but this article argues that a steady stream of issues = a community of active and engaged users — often somewhat programming-literate since it’s GH. I find that argument compelling.https://t.co/X2vY2QxRfV
— Juan Nunez-Iglesias (@jnuneziglesias) October 7, 2022
I’d be remiss if i didn’t at least mention that some of the discussion steered towards community around tools. For instance, Evan brought up community governance being a priority.
Governance was another aspect I was going to suggest. The “benevolent dictator for life” model is… risky
— Evan (he/him) (@darth_mall) October 5, 2022
But the reality of our users was summarized well here by Tania. Most scientists developing tools are trying to simplify workflows with repeated code. Workflows that others may be trying to develop to do the same thing. They aren’t necessarily focused on community, at least not yet.
Also note - a lot of folks developing scientific software are more interested in the pragmatic side of open source (i.e availability, making the codebase public and accessible) rather than building a community around it.
— ✨Tania Allard 💀🇲🇽 🇬🇧 she/her (@ixek) October 6, 2022
Further, capturing metrics around community is hard as Melissa points out. Most of the above resources don’t capture these types of items. And also, how would one capture the work on a community manager quantitatively?
Depends on what is "health". Sustainability? Funding? Maintainability? Culture? I think most metrics are proxies to some other thing we want to measure, but are not representative. For ex looking at github, a bunch of the work done by community managers is not captured at all.
— Ax=13!!! (@melissawm) October 5, 2022
But it’s a great start!
Joel rightfully noted that my original tweet seemed less concerned with package quality and more concerned with community and use. I think they are right. We are hopeful that peer review metrics and recommended guidelines for packaging will get at package quality.
I guess that depends on whether you're concerned about the quality of the package or the popularity of the package.
— Joel Bennett (@Jaykul) October 5, 2022
Most of your proposed metrics seem to be about size and activity of the COMMUNITY using the package rather than quality or reliability of the package itself.
There is a lot of work to do in this area. And a lot of work that has already been done to learn from. It’s clear to me that we should start by looking at what’s been done and what people are already collecting in this area. And then customize to our needs.
A few items that stand out to me that we could begin collecting now surrounding package maintenance and community adoption are below. This list will grow but it’s just a start.
I will share a more comprehensive list once we pull that together as an organization in another blog post. Stay tuned for more!
If you have any additional thoughts on this topic or if I missed important parts of the conversation please share in the comment section below.
]]>This blog is the second in a 3 part series. In the previous blog post, I discussed why the health of (Python) open source packages should matter to you as a scientist (and as a person who values and uses free and open source tools in your workflow). In this post, I’ll talk more about why collecting metrics are critical to both program development success and to the success of open source tools. I’ll wrap up this series with a discussion on what types of package metrics pyOpenSci should be collecting around the free and open source Python packages that you use..
NOTE: all of this is in the context of a conversation on Twitter. It is not a comprehensive perspective on the final metrics that pyOpenSci plans to collect.
I’ve created a few open science focused programs now from the ground up. One at NEON and another at CU Boulder. When building a new program, one of the first things that I do (after defining the mission and goals) is to define the metrics that constitute success.
These metrics are critical to define early because:
If you have evaluation or education in your professional background like I do, you may even create a logic model to map activities to outcomes and goals. This logic model helps you define how to collect the data that you need to track outcome success.
As I am building the pyOpenSci program, I find myself thinking about what metrics around Python scientific open source software we want to track to better understand:
As mentioned above, collecting metrics from the start of your efforts allows you to get off the ground running with data that you can use to compare to future data. Thus while it may not be the work that you want to do, it will help your future self.
For pyOpenSci, collecting metrics allows us in the future to evaluate our programs and adaptively change things to make sure we are getting the outcomes that we want.
Outcomes such as
In a previous post, I spoke generally about why open source should matter to you as a scientist and as a developer or package maintainer.
To better understand what data we should be collecting to track our packages’ health over time, I went to Twitter to see what my colleagues around the world had to say. That conversation resulted in some really interesting insights.
In my next blog post, I will summarize the discussion that happened on twitter.
Most importantly, it allowed me to begin to break down and group metrics in terms of pyOpenSci goals.
We hope that:
We need metrics to understand things like
Based on all of the feedback on twitter, which is summarized in the next post, and what I think might be a start at what pyOpenSci needs to consider, I organized the conversation into four broad categories:
These four categories, at not by any means mutually exclusive. They are merely a way to begin to organize an engaging and diverse conversation. All of the categories are priorities of pyOpenSci.
Leave any feedback that you have in the comments section below.
]]>