Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

General Concerns #140

Closed
benwilson512 opened this issue Dec 23, 2015 · 6 comments
Closed

General Concerns #140

benwilson512 opened this issue Dec 23, 2015 · 6 comments

Comments

@benwilson512
Copy link
Collaborator

Over the course of working through the PR to fix #128 and #129 I've run into a variety of different problems that I'd like to discuss. These problems are unfortunate, because Exq serves a very important role in the community. Many people are coming from a Ruby / Rails background and Exq helps make integrating Elixir with existing Ruby systems much easier. Nonetheless, the problems I present here individually and in aggregate are serious enough to warrant great caution on the part of any who might consider using this library.

I bring these problems up not only because I want Exq to succeed, but because I only have the available time to fix a handful of the most serious ones. A lot of valuable work has gone into this library, particularly around managing the redis association in such a way as to be compatible with Sidekiq. I'm hopeful that this post will bring about awareness of the situation so that others can address what I cannot.

NOTE: Two of these I've already created separate issues for, but I include them here for the sake of completeness.

* indicates that it will be at least partially addressed in benwilson512#1

Severe Issues

The ambiguities around supervision in general and job supervision in particular mean that I have no idea how this library behaves when I call ``./bin/myapp stop` via exrm for example. While it shuts down is it possible for a job to still get started? What happens to running jobs? Clarity here is paramount as I need to be sure that deploying new code to production will not orphan jobs.

General Issues

These are a hodgepodge of anti-patterns, code smells, and barriers to contribution.

  • *No real top level Supervisor
  • *Every GenServer has its own one_for_one supervisor for no discernable reason.
  • *Hardcoded configuration defaults (not in themselves bad) exist in many disparate locations, often redundantly.
  • Max restarts in supervisors are higher than normal, and in some cases VERY high.
  • Configuration not actually passed through to all children
    • The {:ok, sup} = Exq.Enqueuer.start_link([port: 6379]) example from the readme would seem to indicate that individual components of the system can be started with custom configuration. However, the various hardcoded Code.get calls mean that some parts will still fallback to Exq defaults whereas others will actually use the provided configuration.
  • Queue state is largely duplicated between the manager's process state and an :ets table that is only accessed by that same process.
  • handle_call({:retries}, _from, state) why is :retries wrapped in a tuple? This pattern is ubiquitous.
  • GenServer.start_link(__MODULE__, [opts], []) and then init([opts])
    • Also ubiquitous. The needless wrapping and unwrapping of values adds noise and confusion.
  • handle_info catch all clauses with Logger.error("INVALID MESSAGE #{info}")
    • Unlike handle_call and handle_cast, handle_info is not constrained to only receive pre-defined messages. It is not necessary to go out of the way to Logger.error them.
  • handle_info / terminate / stop clauses included in modules that don't do anything. use GenServer provides these basic callbacks already
  • Every test has Code.require_file "test_helper.exs", __DIR__ at the top. mix test runs test_helper by default
  • api.ex and enqueuer.ex contain a bunch of identical functions that do a GenServer.call or cast to an unspecified pid.
  • Tests are very brittle
    • The seed is hard coded
    • Global state via mix config and process name registration abounds
    • :timer.sleep calls abound.
  • Documentation: none
  • TypeSpec: none

Going Forward

These are the problems I've run into while trying to fix the supervision architecture. I have not yet had time to review in detail job lifecycle management, anything related to redis usage, nor the messaging patterns of the application.

I hope to have time during the next week or so to finish the changes I've started to improve the supervision tree. Notably, this may not include supervising the workers, as that should probably be addressed as part of a general look at job life cycle management. We will see.

Afterward, I believe the most useful next step would be documentation. What does each GenServer do? How does it interact with the other GenServers? What is the lifecycle of a job? Some clarity to these questions would go a long way towards facilitating further contributions that can improve upon the current state of affairs.

Once again I really do appreciate the effort that's been put into Exq, it is a valuable part of the Elixir community.

Thanks!

@akira
Copy link
Owner

akira commented Dec 23, 2015

Hi Ben,

Thanks for the detailed writeup.

I agree that there are definitely issues to be resolved, if we can discuss and prioritize those then we can create issues and start fixing them. Recently I have been focusing on getting the Redis job queueing and execution working with retries and reliable queueing - so that job are retried when failed, and in progress jobs are not lost if a worker dies (or orphan jobs as you mentioned). I wanted to get these in before doing any major architectural changes, and I feel like the system works well now in practice. But as you mentioned, it can definitely can be improved to make it easier to maintain as well as more idiomatic in terms of architecture, syntax, and supervision (I started committing to this in 2013 when there weren't as many great Elixir resources as today)

I think the timing is great now to take a step back and look at some of the patterns used and take a stab at improving the code base. It also looks like you're handling a good amount of these which is great.

I definitely agree with most of the things you mentioned. These are issues to be addressed (besides the ones included in your refactor):

  • documentation would be there at least at a high level for each GenServer and detailing the job lifecycle.
  • Shutdown process - as you mentioned, I think this can be improved to make sure supervision is setup so that shutdown works correctly
  • Improving worker supervision so that workers shut down along with the entire tree. Right now, worker processes are ephemeral and spawned for each job, but this is an issue with shutdown, and I think could be improved overall. Note that if worst case a worker crashes, the job is saved off to a "backup_queue" atomically as it is popped off the job queue, and gets re-enqueued in system start, so the job wouldn't get lost.
  • Queue state - this is an issue only with dynamic subscriptions that get added after the fact - seems like queue state should be in a separate process or shared :ets table started started at the supervisor level so it survives restarts ( issue with dynamically adding / removing queues Apply error kernel pattern to manager state #117). Let me know if you had thoughts on one or the other.
  • Tests - a lot of the tests are integration "black box" tests, so some of them are timing dependent. They work in practice, but it would be great if we can remove the timeouts where possible (for example use receive with expected messages from workers where possible). I think it would also be great to have more unit level tests as well. While the integration tests can be brittle, I have found them valuable in detecting issues and they usually pass consistently. The failure condition / flakey connection tests are definitely brittle, and I don't think are as useful in a normal flow. I left them in there more as a guideline of what needs to be handled.
  • Lower max supervisor restarts as suggested
  • Syntax issues, and things like tuples in messages, unneeded implementation handle_info / terminate / stop should be able to be fixed pretty easily
  • "api.ex and enqueuer.ex contain a bunch of identical functions that do a GenServer.call or cast to an unspecified pid." - the enqueuer is a GenServer used for getting jobs into Redis, and can be fired up just for doing that in the case where you just need to pass messages to another system. Right now it's also doing double duty for accessing some of the stats in the UI, and that should definitely be split out.

Some comments:

  • The "record_process" in Stats is basically just for stats that show up in the dashboard. The job retries are done at the manager level: https://github.com/akira/exq/blob/master/lib/exq/manager/server.ex#L120
  • I'm pretty sure configuration settings for things like Redis connection are passed down correctly and use Keyword with a Config as default. I did notice a few configuration settings like "timeout" that are not passed down correctly.
  • The :ets table was in use until one of the recent commits. I would be in favor of leaving it until most of the refactoring is done, in case it is needed.

After discussion, we can split these out into issues and can be there to be addressed.

Thanks!

@akira
Copy link
Owner

akira commented Dec 25, 2015

@benwilson512 Good suggestion about the documentation, I've started working on a branch, see #141.

I was also mistaken regarding the stats and retrying jobs, good catch, that definitely needs to be fixed. I can do that.

I was also wondering if you had thoughts about using a GenEvent server to broadcast events about stats / jobs vs directly sending to the different components? Seems like one of the complication is that Exq allows custom named processes, so not sure if it will simplify that a bit (I noticed a lot of the libraries don't do this and just do module name).

@akira
Copy link
Owner

akira commented Jan 5, 2016

Addressed by #146 and #147:

  • Job Workers use GenServer.start calls
  • Jobs restarting by worker instead of stats server

Addressed by #148 and #152:

  • Max restart lowered for Manager
  • Removed single argument in GenServer wrapped in tuple
  • Removed wrapping and unwrapping of values in GenServer init
  • Simplified where possible:
    • :timer.sleep calls in tests
    • Global state via mix config and process name registration in tests
  • duplication of api.ex and enqueuer.ex functions
  • removed handle_info use of Logger.error
  • removed catch all handle_call and handle_cast
  • removed callbacks provided by GenServer

Documentation provided by #141

@benwilson512
Copy link
Collaborator Author

Hey! I apologize for taking so long to respond, the holidays required a lot of travel on my part and I wasn't able to make time.

Your comments have been really encouraging, and I'm confident that the project is in good hands. I have not had much time to make progress on my branches, but I hope to this week. I'm really excited to see the progress made so far!

@akira
Copy link
Owner

akira commented Jan 8, 2016

@benwilson512 no worries, hope you had a good holiday, and looking forward to the changes!

@akira
Copy link
Owner

akira commented Jan 29, 2016

I believe most of these have been addressed except these which are listed issues:

Will close this issue unless any objections

@akira akira closed this as completed Jan 30, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants